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I.  PREFACE 


The  need  for  speed  accompanied  by  reliability  has  driven  many  advances  in  machine 
design.  The  history  of  computing  is  replete  with  examples — many  from  scientific 
fields — where  necessity  became  the  impetus  for  faster,  more  reliable  machinery. 
Without  exception,  history  and  past  designs  have  played  key  roles  in  the  invention 
of  new  equipment.  The  maturity  of  mechanical  calculator  design  was  foundational 
in  the  construction  of  electronic  computers.  Today’s  multiprocessor  computers  are 
extensions  of  uniprocessor  machines  and  include  technology  developed  by  our  tele¬ 
phone  industry.  Many  well-worn  tools  and  lessons  from  the  past  can  be  applied. 
Many  new  ideas  must  be  put  to  the  test.  This  thesis  is  about  applying  old  principles 
and  evaluating  new  tools  and  equipment. 

A.  A  SURVEY  OF  COMPUTING  MACHINERY 

Nothing  is  more  important  than  to  see  the  sources  of  invention,  which  are, 
in  my  opinion,  more  interesting  than  the  inventions  themselves. 

-  GOTTFRIED  WILHELM  LEIBNIZ  (1646-1716) 

1.  Beginnings 

The  history  of  mathematics  and  computing  is  as  old  as  civilization.  Tools 
like  the  abacus  have  been  used  to  simplify  arithmetic  problems.  Wilhelm  Schickhard 
(1592-1635),  Blaise  Pascal  (1623-1662),  and  Gottfried  Wilhelm  Leibniz  designed  and 
built  mechanical,  gear-driven  calculators.  The  latest  of  these  was  essentially  a  four- 
function  calculator.  By  the  mid-1800s,  Charles  Babbage  had  designed  his  Difference 
Engine  and  proceeded  to  the  more  advanced  Analytical  Engine.  These  machines  w’ere 
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never  completed  (at  least  not  to  the  grand  scale  that  Babbage  planned),  but  the  basic 
design  of  the  Analytical  Engine  lies  at  the  heart  of  any  modern  computer.  Consider 
his  motivation. 


The  following  example  was  frequently  cited  by  Charles  Babbage  (1792-1871) 
to  justify  the  construction  of  his  first  computing  machine,  the  Difference  Engine 
[Ref.  Ij.  In  1794  a  project  was  begun  by  the  French  government  under  the  direction 
of  Baron  Gaspard  de  Prony  (1755-1839)  to  compute  entirely  by  hand  an  enormous 
set  of  mathematical  tables.  Among  the  tables  constructed  were  the  logarithms  of 
the  natural  numbers  from  1  to  200,000  calculated  to  19  decimal  places.  Comparable 
tables  were  constructed  for  the  natural  sines  and  tangents,  their  logarithms,  and  the 
logarithms  of  the  ratios  of  the  sines  and  tangents  to  their  arcs.  The  entire  project 
took  about  2  years  to  complete  and  employed  from  70  to  100  people.  The  mathemat¬ 
ical  abilities  of  most  of  the  people  involved  were  limited  to  addition  and  subtraction. 

A  small  group  of  skilled  mathematicians  provided  them  with  their  instructions.  To 
minimize  errors,  each  number  was  calculated  twice  by  tuv>  independent  human  cal¬ 
culators  and  the  results  were  compared.  The  final  set  of  tables  occupied  17  large 
folio  volumes  (which  were  never  published,  however).  The  table  of  logarithms  of  the 
natural  numbers  alone  was  estimated  to  contain  about  8  million  digits. 

This  quote,  from  Hayes  [Ref.  2:  p.  1],  helps  to  explain  why  computers 
exist  and  shows  some  of  the  incentive  for  making  them  better.  Computing  ma¬ 
chinery  is  designed  for  speed  and  reliability.  A  computer’s  “performance”  should 
be  measured  against  both  of  these  components.  Speed  normally  receives  the  most 
attention.  Reliability,  by  whatever  label  you  choose  to  give  it,  rarely  receives  due 
(and/or  timely)  attention.  Too  often  errors  and  issues  of  correctness  receive  careful 
consideration  in  reactive — not  proactive — situations.  Kahan  says,  “The  Fast  drives 
out  the  Slow  even  if  the  Fast  is  wrong”  [Ref.  3;  p.  596]. 

The  correctness  side  of  performance  is  a  much  tougher  game;  and  reliability 
can  be  a  fairly  subjective  matter.  Often  we  pursue  solutions  that  are  “good  enough” 
(and  this  cannot  always  be  defined).  Time,  on  the  other  hand,  has  well-defined  units 
and  the  standards  for  measuring  time  enjoy  a  history  as  old  as  the  first  sunrise.  The 
ease  with  which  the  programmer  can  access  the  machine’s  clock  makes  measurements 
of  this  side  of  performance  somewhat  easier. 
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Figure  1.1:  Technologies  and  Computing  Speed 


Industry  demands  fast  machines  because  “time  is  money”  and  speed  alone 
can  make  difficult,  time-consuming  problems  tolerable.  Without  doubt,  the  speed 
of  a  processor  and  execution  time  are  important  performance  considerations.  But 
speed  is  partly  dependent  upon  technology.  Babbage’s  designs  represented  quite  an 
advance,  but  they  could  not  be  realized  in  his  day.  Technology  can  determine  which 
designs  succeed,  and  to  what  extent.  Figure  1.1  compares  several  recent  technologies 
using  speed  (measured  in  operations  per  second)  as  the  yardstick.  The  data  for  this 
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illustration  was  taken  from  Hayes  [Ref.  2;  p.  9].  As  the  figure  indicates,  it  Wtis  nearly 
a  century  after  Babbage’s  work  when  major  technological  advances  came  about. 


2.  Electricity 

Significant  gains  in  speed  were  made  possible  when  electricity  could  be  used 
in  computer  engineering.  The  United  States  census  of  1890  employed  punched  cards 
that  were  read  using  electricity  and  light.  Herman  Hollerith  (1860-1929),  the  de¬ 
signer  of  these  cards,  formed  a  company  that  would  later  join  others  and  (in  1924) 
take  on  the  name  International  Business  Machines  Corporation.  Punched  paper  tape 
Wets  later  used  by  IBM  in  the  Harvard  Mark  I,  a  general-purpose  electromechani¬ 
cal  computer  designed  by  Howard  Aiken  (1900-1973).  In  the  late  1930s,  at  Iowa 
State  University,  John  V.  Atanasoff  was  creating  a  special-purpose  machine  to  solve 
systems  of  linear  equations.  He  is  credited  with  “the  first  attempt  to  construct  an 
electronic  computer  using  vacuum  tubes”  [Ref.  2:  p.  16]. 

In  1943,  J.  Presper  Eckert  and  John  W.  Mauchly  began  work — at  the  Uni¬ 
versity  of  Pennsylvania — to  direct  the  creation  of  “the  first  widely  known  general- 
purpose  electronic  computer”.  The  Electronic  Numerical  Integrator  and  Calculator 
(ENIAC)  project  was  funded  by  the  U.  S.  Army  Ordnance  Department.  The  30-ton 
machine  was  completed  in  1946.  It  held  more  than  18,000  vacuum  tubes.  It  could 
perform  a  ten-digit  multiplication  in  three  milliseconds,  three  orders  of  magnitude 
faster  than  the  Harvard  Mark  I.  [Ref.  2:  pp.  17-18] 

3.  First  Generation  Computers 

From  Babbage’s  Analytical  Engine  to  ENIAC,  computer  architectures  held 
data  and  programs  in  separate  memories.  In  1945,  John  von  Neumann  (1903-1957) 
proposed  the  stored-program  concept  (i.e.,  programs  and  data  could  be  stored  in 
the  same  memory  unit).  The  Hungarian-born  mathematician’s  involvement  in  the 
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ENIAC  project  is  not  remembered  by  many,  but  the  “von  Neumann  architecture” 
has  become  commonplace.  In  fact,  it  “has  become  synonymous  with  any  computer 
of  conventional  design  independent  of  its  date  of  introduction”  [Ref.  2  :  p.  31]. 
Hennessy  and  Patterson  [Ref.  3 :  pp.  23-24]  object  to  the  widespread  use  of  this 
term,  claiming  that  Eckert  and  Mauchly  deserved  more  of  the  credit. 

In  1946,  von  Neumann  (and  others)  began  to  design  such  an  architecture 
at  the  Institute  for  Advanced  Studies  (IAS),  Princeton.  This  machine,  now  called 
the  IAS  computer,  is  representative  of  so-called  first-generation  computers  (as  Hayes 
points  out:  “a  somewhat  short-sighted  view  of  computer  history”).  The  IAS  machine 
was  roughly  ten  times  faster  than  ENIAC  [Ref.  3:  p.  24].  During  the  1946-1948 
timeframe,  A.  W.  Burks,  H.  H.  Goldstine,  and  John  von  Neumann  wrote  a  series  of 
reports  describing  the  IAS  design  and  programming.  The  advances  and  refinements 
in  computer  design  that  came  out  of  this  period  were  important  and  lasting.  By 
1950,  von  Neumann  and  his  colleagues  had  formed  a  foundation  of  theory  and  design 
worthy  of  advanced  technology.  [Ref.  2:  pp.  19-20] 

4.  Transistors 

The  change  from  vacuum  tube  to  transistor  technology  marked  the  begin¬ 
ning  of  the  sccon  d-generation”  of  computers  (approximately  1955-1964).  Transis¬ 
tor  technology  provided  faster  switching  elements,  but  this  weis  not  the  only  change 
of  the  decade.  Many  of  the  plans  of  the  late  forties  and  early  fifties  involved  memory, 
so  it  was  fitting  that  ferrite  cores  and  magnetic  drums  be  used  for  faster  main  mem¬ 
ories.  Changes  such  as  these  led  Hennessy  and  Patterson  to  conclude  that  “cheaper 
computers”  were  the  principal  new  product  of  the  early  1960s  [Ref.  3:  p.  26]. 

Additionally,  machines  began  to  become  more  sophisticated.  The  space  and 
tasks  of  the  central  processing  unit  (CPU)  and  main  memories  were  decentrzJized 
with  the  advent  of  special-purpose  processors  to  augment  the  CPU  and  speciaJ- 
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purpose  memories  (e.g.,  registers)  to  augment  the  main  memory.  Finally,  system 
software  was  becoming  a  greater  issue.  Programming  continued  moving  upward, 
away  from  the  machine  level,  and  the  processing  of  batch  jobs  was  becoming  more 
automated.  [Ref.  2:  pp.  31-32] 

5.  Integrated  Circuits 

The  first  integrated  circuit  (IC)  was  introduced  in  1961  [Ref.  4 ;  p.  1],  and 
the  use  of  ICs  would  be  among  the  most  significant  advances  evident  in  third- 
generation  computers  (starting  about  1965).  Integrated  circuits  brought  major 
changes  in  cost,  maintenance,  reliability,  and  the  amount  of  real  estate  required. 
Other  than  these  hardw’are  improvements  (circuits  and  memory),  third-generation 
computing  was  not  easy  to  distinguish  from  that  of  the  second  generation.  There  was 
some  migration  from  hardware  to  software  (e.g.,  microprogramming),  more  special¬ 
ized  and  compartmentalized  CPUs  (e.g.,  pipelining),  and  system  software  continued 
to  advance  (e.g.,  operating  systems  that  could  support  multiprogramming  through 
“time-slicing”).  [Ref.  2 ;  p.  40] 

6.  Instruction  Set  Trade-Offs 

A  large  part  of  designing  computer  hardware  and  software  involves  analysis 
of  cost-performance  ratios.  Other  than  genuine  advances  in  design  or  technology, 
almost  every  aspect  of  computer  architecture  involves  trade-offs.  There  is  usually 
a  spectrum  of  options  from  which  the  computer  architect  chooses,  and  the  “best” 
solutions  are  not  always  found  near  the  ends  of  the  spectrum.  Performance  can  rarely 
be  optimized  with  respect  to  both  space  and  time,  so  a  balance  must  be  sought.  This 
space-time  conflict  and  others  appear  when  a  designer  must  select  a  sophisticated 
instruction  set,  or  a  very  simple  one,  or  one  of  the  many  options  along  the  spectrum 
between  these  options. 
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In  the  late  1970s  and  early  1980s  both  hardware  and  software  became  pro¬ 
gressively  more  sophisticated.  Instructions  became  longer  and  more  complex.  The 
Complex  Instruction  Set  Computer  (CISC)  was  popular.  This  design  has  the  advan¬ 
tage  of  powerful  instructions,  but  the  machine  must  decode  each  instruction  (it  is 
a  binary  code).  The  decoding  process  favors  brevity  because  longer  instructions  re¬ 
quire  more  levels  of  decoding  circuitry.  Nonetheless,  if  the  longer  instructions  could 
carry  enough  meaning,  the  decoding  endeavor  would  be  justified. 

IBM  researchers  uncovered  a  provocative  statistic — 20%  of  the  instruction 
set  was  carrying  80%  of  the  burden  [Ref.  5 :  p.  5].  The  instruction  set  had  become 
too  complex.  With  some  help  from  several  researchers  and  IBM,  the  Reduced  In¬ 
struction  Set  Computer  (RISC)  architecture  became  popular.  RISC  machines  admit 
a  smaller  vocabular}’,  but  claim  quicker  comprehension.  In  fact,  the  goal  of  the  RISC 
architectures  is  one-cycle  execution  of  the  instructions  [Ref.  5:  pp.  6-7).  Hennessy 
and  Patterson,  both  key  contributors  to  the  RISC  movement,  give  an  indication  of 
the  current  broad  acceptance  of  the  RISC  architecture  [Ref.  3:  p.  190]: 

Prior  to  the  RISC  architecture  movement,  the  major  trend  had  been  highly 
microcoded  architectures  aimed  at  reducing  the  semantic  gap.  DEC,  with  the  VAX, 
and  Intel,  with  the  iAPX  432,  were  among  the  leaders  in  this  approach.  In  1989, 
DEC  and  Intel  both  announced  RISC  products — the  DECstation  3100  (based  on  the 
MIPS  Computer  Systems  R2000)  and  the  Intel  i860,  a  new  RISC  microprocessor. 
With  these  announcements,  RISC  technology  has  achieved  very  broad  acceptance. 

In  1990  it  is  hard  to  find  a  computer  company  without  a  RISC  product  either 
shipping  or  in  active  development. 

Three  major  research  projects  were  central  to  early  RISC  developments.  The  first — 
the  IBM  801 — began  in  the  late  1970s,  under  the  direction  of  John  Cocke.  In  1980, 
David  Patterson  and  his  colleagues  at  the  University  of  California  at  Berkeley  began 
the  RISC-I  and  RISC-II  projects  for  which  the  architecture  is  named.  Finally,  John 
Hennessy  and  others  at  Stanford  University  “published  a  description  of  the  MIPS 
machine’'  in  1981.  [Ref.  3:  p.  189] 
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7.  Multiprocessors  and  Multicomputers 


The  most  recent  advances  in  the  design  of  computing  machinery  include 
parallel  and  concurrent  architectures.  The  terminology  associated  with  these  ma¬ 
chines  has  been  developing  for  about  twenty-five  years,  but  it  is  still  immature. 
The  terms  “multiprocessor”  and  “multicomputer”,  for  instance,  are  sometimes  used 
with  additional  meaning.  C.  Gordon  Bell  proposes  that  an  MIAfD  machine  with 
message  passing  and  no  shared  memory  be  called  a  multicomputer.  He  calls  a 
shared-memory  MIMD  machine  a  multiprocessor  [Ref.  6:  p.  1092].  This  termi¬ 
nology  seems  to  be  on  the  way  to  acceptance,  and  it  seems  useful  in  giving  a  general 
characterization  to  many  systems,  but  it  lacks  the  sort  of  precision  that  may  be 
necessary. 

First,  the  word  “computer”  usually  carries  many  expectations  with  it.  From 
a  computer,  we  expect  things  like  input  and  output  facilities,  peripheral  devices,  and 
so  on.  These  are  things  that  a  node  on  a  typical  “multicomputer”  does  not  always 
possess.  A  “processor”  is  just  the  opposite.  It  might  be  just  about  any  sort  of 
processor  and  we  are  cautious  about  attaching  any  expectations  to  the  term.  Many 
processors  are  special-purpose  machines,  but  (more  substantial)  central  processing 
units  and  arithmetic  logic  units  are  also  numbered  among  processors.  The  terms 
“computer”  and  “processor”  are  not  precise. 

Secondly,  by  automatically  associating  Flynn’s  taxonomy,  memory  mod¬ 
els  (e.g.,  shared,  distributed),  and  other  things  with  a  terminology’,  we  reduce  their 
importance  and  hide  them  behind  the  term.  By  using  the  term  “multicomputer”, 
without  careful  definition  up  front,  we  run  the  risk  of  forgetting  that  we  are  talking 
about  an  MIMD  machine  that  uses  message  passing  and  has  no  shared  memory.  Ad¬ 
ditionally,  this  terminology — packed  with  expectations — ignores  an  entire  spectrum 
of  very  real  possibilities.  Are  we  saying  that  a  machine  cannot  employ  a  combination 
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of  shared  and  distributed  memory?  Using  this  terminology,  how  would  we  say  that 
the  memory  available  to  each  node  of  a  given  system  was  30  percent  shared  and  70 
percent  local  (distributed)? 

Nevertheless,  the  terms  have  some  use,  provided  we  don’t  expect  too  much 
of  them.  After  all,  we  distinguish  cars  from  trucks  in  everyday  conversation  with 
reasonably  little  confusion.  But — in  the  same  way  that  it  is  not  prudent  to  assume 
that  “car”  implies  a  vehicle  equipped  with  a  V-8  engine  and  four  doors — we  should 
be  careful  to  guard  against  packing  too  many  specifics  and  expectations  into  the 
terms  “multiprocessor”  and  '  multicomputer.”  For  this  reason,  the  terms  multipro¬ 
cessor  and  multicomputer  are  used  almost  interchangeably  in  this  work.  A  conscious 
effort  is  made  to  support  them  with  a  clear  description  of  the  memory  paradigm, 
communications  facilities,  and  so  on. 

Bell's  terminology  identifies  the  systems  used  in  this  work  (iPSC/2  and 
transputer  networks)  as  multicomputers.  Nevertheless,  I  often  use  the  term  “mul¬ 
tiprocessor”  to  identify  a  system  with  more  than  one  processor  (such  as  the  ones 
described  in  Chapter  V  and  Appendix  B).  That  is,  multiprocessor  means  nothing 
more  than  the  expected  combination  of  “multi”  with  “processor.”  To  forestall  confu¬ 
sion,  the  rest  of  the  thesis  pertains  to  distributed  memory  machines  that  use  message 
passing  to  communicate  instructions  and  data  between  nodes. 

8.  Uniprocessors  and  Multiprocessors 

At  the  chip  level,  multiprocessor  systems  resemble  their  single-processor 
predecessors.  Experience  (e.g.,  telephone  industry,  electronic  technology)  and  a  foun¬ 
dation  of  theory  and  design  (e.g.,  von  Neumann’s  work,  network  theory)  are  distinct 
benefits  in  the  development  of  equipment  and  techniques  for  distributed  and  parallel 
computing.  From  a  system  perspective,  though,  the  concurrent  use  of  more  than  one 
processor  creates  a  fundamentally  different  environment. 
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Uniprocessor  systems  differ  substantially  from  multiprocessors  and  multi¬ 
computers  in  their  ability  to  access  data  without  competition.  In  the  presence  of 
more  than  one  processor — regardless  of  memory  model — there  is  a  need  to  coordinate 
requests  for  data.  This  means  that  the  multicomputer  must  accommodate  interpro¬ 
cessor  communications.  The  nodes  of  a  multiprocessor  system  must  work  together 
efficiently  to  justify  the  cost  of  the  resulting  system.  Some  parts  of  the  solution  are 
relatively  mature,  but  a  vtist  territory — algorithms,  electronic  components,  media 
for  communication,  and  software  engineering  techniques — begs  further  exploration. 

B.  CURRENT  APPROACHES 
1.  Machines 

To  compare  the  capabilities  of  different  machines,  some  method  of  bench¬ 
marking  is  typically  used.  By  timing  the  ex«.vuiion  of  a  certain  program(s)  on  a  given 
machine  we  can  determine  its  performance  for  the  given  problem.  By  comparing  the 
execution  times  for  the  same  problem(s)  on  different  machines,  we  arrive  at  a  notion 
of  their  relative  power.  A  popular  method  for  sizing  up  the  computing  power  of 
a  machine  is  the  LINPACK  benchmarking  program  [Ref.  7).  This  is  essentially  a 
program  involving  the  solution  of  a  dense  system  of  linear  equations. 

Currently,  under  this  LINPACK  test,  the  fastest  machines  in  the  world 
have  surpassed  the  gigaflop  mark  (a  billion  floating-point  operations  per  second). 
Table  1.1,  adapted  from  Dongarra’s  report  [Ref.  8:  P-  21],  shows  performance  data. 
The  leftmost  column  of  this  table  gives  the  name  of  the  system  and  the  cycle  time  (in 
parentheses).  The  next  column  contains  p,  the  number  of  processors  used  to  obtain 
the  data  that  is  shown  in  the  four  remaining  columns.  For  most  systems  (e.g.,  the 
Intel  iPSC/860)  the  size  of  the  system  (number  of  processors  used  for  a  given  run) 
can  be  scaled,  so  data  was  reported  for  several  different  system  sizes. 
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TABLE  1.1:  WORLD’S  FASTEST  COMPUTERS 


Computer  (Clock  Rate) 

— 

P 

^mox 

— 

^mox 

WBM 

^peak 

Intel  Delta  (40  MHz) 

512 

11.9 

7000 

20 

Thinking  Machines  CM-200  (10  MHz) 

2048 

9.0 

28672 

11264 

20 

Intel  Delta  (40  MHz) 

256 

5.9 

5000 

10 

Thinking  Machines  CM-2  (7  MHz) 

2048 

5.2 

26624 

11000 

14 

Intel  Delta  (40  MHz) 

192 

4.0 

w  uiiia 

4000 

7.7 

Intel  Delta  (40  MHz) 

128 

3.0 

3500 

5 

Intel  iPSC/860  (40  MHz) 

128 

1.9 

3000 

5 

nCUBE  2  (20  MHz) 

1024 

1.9 

21376 

3193 

2.4 

Intel  Delta  (40  MHz) 

64 

1.5 

KHUj 

3000 

2.6 

nCUBE  2  (20  MHz) 

512 

.958 

2240 

1.2 

Intel  iPSC/860  (40  MHz) 

64 

.928 

2500 

2.6 

Fujitsu  APIOOO 

512 

2.251 

2500 

2.8 

Intel  iPSC/860  (40  MHz) 

32 

.486 

1^^ 

1500 

1.3 

nCUBE  2  (20  MHz) 

256 

.482 

10784 

1504 

.64 

MasPar  MP-1  (80  ns) 

16384 

.44 

5504 

1180 

.58 

Fujitsu  APIOOO 

256 

1.162 

18000 

1600 

1.4 

Intel  iPSC/860  (40  MHz) 

16 

.258 

3000 

1000 

.64 

nCUBE  2  (20  MHz) 

128 

.242 

7776 

1050 

.32 

Fujitsu  APIOOO 

128 

.566 

mm 

1100 

.71 

Intel  iPSC/860  (40  MHz) 

8 

.132 

Bl 

600 

.32 

nCUBE  2  (20  MHz) 

64 

.121 

5472 

701 

.15 

Fujitsu  APIOOO 

64 

.291 

648 

.36 

Intel  iPSC/860  (40  MHz) 

4 

.061 

400 

.16 

nCUBE  2  (20  MHz) 

32 

.0611 

3888 

486 

.075 

Intel  iPSC/860  (40  MHz) 

2 

1000 

400 

.08 

nCUBE  2  (20  MHz) 

16 

5580 

342 

.038 

Intel  iPSC/860  (40  MHz) 

1 

750 

.04 

nCUBE  2  (20  MHz) 

8 

3960 

241 

.019 

nCUBE  2  (20  MHz) 

4 

2760 

143 

.0094 

nCUBE  2  (20  MHz) 

8 

1280 

94 

.0047 

nCUBE  2  (20  MHz) 

8 

IB 

1280 

51 

.0024 
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The  column  labeled  r^ai  gives  the  performance  (in  gigaflops)  for  the  largest 
problem  run  on  the  machine.  The  size  of  that  largest  problem  is  indicated  by  rimaxi 
where  n  is  the  dimension  of  the  matrix  of  coefficients,  A  €  The  ni/2  column 

gives  the  problem  size  that  yielded  a  rate  of  execution  that  was  half  of  rmax-  Finally, 
Tpe,  ’  iotes  the  theoretical  peak  performance  (in  gigaflops)  for  the  machine. 

This  data  indicates  that  Intel  is  the  current  leader — among  companies  in 
the  United  States — of  the  teraflop  race,  so  we  shall  take  a  closer  look  at  their  prod¬ 
ucts.  The  Intel  i860  microprocessor,  together  with  8  megabytes  of  memory,  forms 
one  of  128  nodes  in  the  hypercube-connected  iPSC/860.  This  machine  achieves  per¬ 
formances  of  nearly  two  gigaflops  with  LINPACK.  iPSC  stands  for  intel  Personal 
Supercomputer,  so  this  entry  would  not  appear  to  target  high-end  markets.  The 
most  significant  project  in  supercomputing  at  Intel  today  is  the  Touchstone  project. 

George  E.  Brown,  chairman  of  the  U.  S.  House  Committee  on  Science, 
Space,  and  Technolog}’,  cut  the  ribbon  around  the  Intel  Touchstone  Delta  at  the 
California  Institute  of  Technology  on  May  31,  1991  [Ref.  9  :  p.  96].  The  Delta 
is  a  mesh  of  528  nodes.  Each  node  holds  an  i860  processor  and  16  megabytes  of 
memory.  This  machine  heis  reached  the  11.9  gigaflop  mark  with  the  LINPACK 
benchmark.  The  closest  competitor  in  the  world  would  appear  to  be  the  CM-200 
from  Thinking  Machines,  Inc.  This  2,048-node  machine  benchmarks  at  9  gigaflops 
[Ref.  8:  p.  21].  The  Touchstone  program  is  not  over.  Intel  plans  to  follow  the  Delta 
with  the  Touchstone  Sigma.  Sigma  will  have  at  least  2,048  nodes,  each  consisting  of 
the  i860  XP  processor  (about  twice  as  powerful  as  the  i860).  [Ref.  9:  p.  96] 

The  European  high-performance  computing  market  favors  the  transputer, 
a  microprocessor  made  by  INMOS.  The  New  York  Times  of  May  31,  1991  lists  one 
German  company,  Parsytec,  and  seven  American  companies — Bolt,  Beranek,  and 
Newman  (BBN),  Cray  Research,  IBM,  Intel,  NCube,  Thinking  Machines,  and  Tera 
Computer — that  have  entered  the  teraflop  race  [Ref.  10].  Parsytec  expects  their  GC 
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to  provide  “the  necessary  2  to  3  orders  of  magnitude  increase  in  performance  above 
existing  supercomputers  to  give  scientists  the  tool  to  attack  their  Grand  Challenges."" 
[Ref.  10:  p.  1] 

Parsytec  envisions  a  system  of  up  to  16,384  processing  elements  based  upon 
the  INMOS  T9000  transputer  (see  Chapter  VII).  This  would  give  the  Parsytec  ma¬ 
chine  25-megaflop  nodes  capable  of  communications  bandwidths  near  100  megabytes 
per  second.  The  Parsytec  design  begins  with  a  cluster  of  seventeen  T9000  processors 
(sixteen  primary  processors  and  the  seventeenth  for  backup)  and  four  C104  worm- 
hole  routing  chips.  From  four  clusters,  the  company  will  craft  a  GigaCube  (or  simply 
Cube)  of  64  processors  (not  counting  redundant  elements  in  the  design).  The  GC- 
1  would  represent  a  one  gigaflop  system  and  this  would  be  the  building  block  for 
greater  systems  (lesser  systems  can  initially  be  equipped  with  16,  32,  or  48  nodes). 
The  processors  in  a  single  (Giga)Cube  are  arranged  in  a  three-dimensional  (4x4x4) 
grid.  [Ref.  10] 

2.  Programming  Practice 

Software  engineering  for  multiprocessor  systems  is  similar  to  contemporary 
practices  for  sequential  machines.  The  programming  languages  used  in  this  work 
provide  normal  C  libraries  with  additional  functions  to  accommodate  inter  processor 
communications.  The  systems  typically  provide  a  loader  designed  to  load  executable 
code  onto  the  (host  and)  nodes  according  to  the  programmer’s  instructions.  Some 
loaders  require  that  the  same  code  be  loaded  onto  each  of  the  nodes.  Other,  more 
flexible,  loaders  allow  the  user  to  specify  which  program  should  be  loaded  onto  each 
node.  The  Logical  Systems  C  network  loader,  LD-NET  is  such  a  program.  It  takes 
a  Network  Information  File  (NIF),  describing  the  network’s  interconnections  and 
loading  instructions,  as  input  and  performs  the  loading  process. 
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C.  THE  FUTURE 


1.  Crossroads 


Parallel  and  distributed  computing  is  in  the  early  years  of  a  very  promising 
lifetime.  We  should  give  careful  consideration  to  the  direction  that  the  field  should 
aissume.  Lacking  years  of  experience,  I  will  lean  on  the  writings  and  advice  of  others 
while  trying  to  peer  a  little  ways  into  the  future  of  parallel  computing.  A  regrettable 
side  effect  of  this  decision  is  that  this  section  seems  to  consist  primarily  of  the 
observations  and  opinions  of  others.  Notwithstanding  the  many  quotations,  I  believe 
that  several  important  ideas  are  exposed. 

This  business  is  filled  with  a  combination  of  old,  established  ideas  and 
proven  techniques.  It  also  holds  new  questions  and  opportunities.  Hamming’s  ad¬ 
vice  [Ref.  11 :  p.  14]  seems  most  fitting  in  this  situation: 

Now  I  see  constantly  attempts  to  force  new  ideas  to  old  molds.  That  is  fre¬ 
quently  sensible:  How  can  I  make  sense  of  what  Vm  seeing  compared  to  what  I  did 
before?  But  also  one  must  ask,  “Am  I  seeing  something  fundamentally  new?”  That 
part  many  people  will  not  try.  You  cannot  afford  to  make  everything  brand  new  and 
not  connect  anything  together  with  existing  ideas,  nor  can  you  try  to  make  every¬ 
thing  fit  into  preconceived  categories.  Some  combination  of  the  two  is  necessary. 

We  limped  through  the  transistor  revolution  and  the  computer  revolution, 
which  are  connected  with  the  bandwidth  revolution;  they  are  all  connected  together. . . 

You  have  to  abandon  old  ideas  when  you  get  an  order  of  magnitude  of  change.  .  .  . 

-  RICHARD  W.  HAMMING 

Developments  in  scientific  computing  today  make  Dr.  Hamming’s  thoughts 
especially  timely.  The  field  needs  to  establish  a  strategy-;  a  direction  that  will  lead 
from  its  present  immaturity  to  a  place  of  fulfilling  its  potential.  Kenneth  Wilson 
proposes  Grand  Challenges  for  computational  science  that  may  help  to  establish  this 
strategy  [Ref.  12]. 
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2.  Grand  Challenges 


Wilson  identifies  three  modes  of  scientific  activity:  theoretical,  experi¬ 
mental,  and  computational.  He  defines  these  areas,  claiming  that — with  today’s 
supercomputers — the  most  recent  science  (computational)  is  becoming  more  signifi¬ 
cant.  So  significant,  in  fact,  that  “long  experience  or  professional  training  is  required 
to  be  successful  in  computational  science  at  the  supercomputer  level,  making  it  ap¬ 
propriate  to  think  of  computational  science  as  both  a  separate  mode  of  scientific 
endeavor  and  new  discipline.”  [Ref.  12:  p.  172] 

W'ilson  is  careful  to  distinguish  computational  science  from  computer  sci¬ 
ence.  He  defines  computer  science  as  the  business  of  addressing  “generic  intellectual 
challenges  of  the  computer  itself”  and  characterizes  computational  science  as  being 
tailored  to  specific  applications  aretis  (with  serious  training  in  the  application  disci¬ 
pline)  [Ref.  12:  p,  172].  To  advance  computational  science,  Wilson  recommends  a 
quantitative  approach  with  clear  strategies  [Ref.  12:  p.  173]; 

The  major  future  opportunities  for  benefits  of  supercomputers  to  basic  re¬ 
search  should  be  identified  without  the  existing  compromises,  but  presented  as  chal¬ 
lenges  to  be  overcome  with  the  many  obstacles  to  success  clearly  explained.  The 
compromises  and  inadequacies  of  current  computations  need  to  be  described  and 
the  level  of  advances  required  to  overcome  these  inadequacies  discussed.  Further¬ 
more,  a  few  key  areas  with  both  extreme  difficulties  and  extraordinary  rewards  for 
success  should  be  labelled  as  the  “Grand  Challenges  of  Computational  Science”. 
Two  examples  are  electronic  structure  and  turbulence.  No  easy  promises  of  success 
in  Grand  Challenges  should  be  offered.  Instead,  computational  scientists  should  be 
building  plans  to  assault  the  Grand  Challenges,  pushing  for  the  major  advances 
in  algorithms,  software,  and  technology  that  will  be  required  for  true  progress  to 
be  achieved  in  these  areas.  The  Grand  Challenges  should  define  opportunities  to 
open  up  vast  new  domains  of  scientific  research,  domains  that  are  inaccessible  to 
traditional  experimental  or  theoretical  modes  of  investigation. 


Wilson  describes  a  few  examples  that  demonstrate  the  limitations  of  exper¬ 
imental  instrumentation  and  the  potential  of  supercomputers.  Weather  prediction, 
astronomy,  materials  science,  molecular  biology,  aerodynamics,  and  quantum  field 


15 


theory  are  the  six  areas  that  Wilson  chooses  to  make  his  point.  He  describes  these 
areas  in  reasonable  detail  and  briefly  mentions  other  topics.  [Ref.  12:  pp.  175-179] 

a.  Mathematical  Background 

Wilson  stresses  the  need  for  sound  design  practices  and  good  algorithms. 
(To  see  why,  consider  Table  A.l).  Additionally,  he  warns  that  we  should  spend  less 
time  in  awe  of  today’s  supercomputing  power  and  admit  that  it  is  terribly  inadequate. 
Modeling  methods  and  sound  mathematical  background  also  appear  in  the  "needs 
improvement”  category.  Wilson  [Ref.  12 :  p.  180]  believes  that 

Mathematical  developments  that  relate  to  numerical  computation  are  highly 
important.  Theorems  about  numerical  errors  or  sources  of  error,  exact  solutions 
end  expansions,  existence  and  uniqueness  proofs  and  the  like,  can  make  a  major  dif¬ 
ference  in  establishing  the  credibility  of  a  numerical  computation.  All  too  frequently 
there  is  too  little  mathematical  understanding  backing  up  numerical  simulation. 

b.  Issues  of  Quality 

Wilson  does  not  consider  these  to  be  the  only  problems  facing  com¬ 
putational  scientists.  He  believes  that  quality  is  endangered,  primarily  from  two 
directions  [Ref.  12:  pp.  180-181]: 

•  A  tendency  to  stay  on  the  safe,  easy  side;  not  wandering  far  from  the  position: 
“our  calculation  agrees  with  experiment.” 

•  The  quality  of  computational  programs,  measured  against  practical  criteria, 
is  lacking.  The  standards  include  rounding  errors  (e.g.,  catastrophic  cancella¬ 
tion),  overflows,  and  stability  (with  respect  to  input  parameters). 
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c.  Languages 


Wilson  cites  a  number  of  reasons  for  revolutions  in  computer  languages. 
In  particular,  he  believes  that  “Fortran  is  in  the  long-term  the  most  fundamental 
barrier  to  progress”  [Ref.  12:  p.  182].  His  approach  is  realistic  enough  to  recognize 
the  vast  investments  of  scientific  communities  in  Fortran.  The  language  cannot  and 
should  not  be  eliminated  in  a  day.  Nevertheless,  it  has  very  serious  shortcomings. 
Some  problems  could  be  overcome  by  a  Fortran  preprocessor  (the  same  idea  as  the  C 
preprocessor).  Other  problems,  like  lack  of  support  for  abstraction  and  the  unnatural 
exclusion  of  basic  mathematical  symbols  in  the  language,  are  not  solved  as  easily. 
[Ref.  12  :  p.  182] 

Wilson  does  not  recommend  a  simple  change  of  language  as  the  solution, 
but  searches  for  deeper  problems.  He  believes  that  the  entire  way  that  computational 
scientists  and  programmers  think  about  and  plan  programs  must  change  as  well. 
After  reading  Wilson’s  analysis  of  language  problems,  the  basic  impression  that 
prevails  is  that  w’e  have  an  urgent  need  for  general-purpose  practices  to  replace 
patchwork,  hit-or-miss,  case-by-c^lse  solutions. 

3.  Generality 

David  Harel  is  also  an  advocate  of  the  need  for  general  purpose  techniques. 
In  the  preface  to  his  book  [Ref.  13 :  p.  viii]  he  warns; 


Curiously,  there  appears  to  be  very  little  written  material  devoted  to  the  sci¬ 
ence  of  computing  and  aimed  at  the  technically  oriented  general  reader  as  well  as 
the  professional.  This  fact  is  doubly  curious  in  view  of  the  abundance  of  precisely 
this  kind  of  literature  in  most  other  scientific  areas,  such  as  physics,  biology,  chem¬ 
istry  and  mathematics,  not  to  mention  humanities  and  the  arts.  There  appears  to 
be  an  acute  need  for  a  technically  detailed,  expository  account  of  the  fundamen¬ 
tals  of  computer  science;  one  that  suffers  as  little  as  possible  from  the  bit/byte  or 
semicolon  syndromes  and  their  derivatives,  one  that  transcends  the  technological 
and  linguistic  whirlpool  of  specifics,  and  one  that  is  useful  both  to  a  sophisticated 
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layperson  and  to  a  computer  expert.  It  seems  that  we  have  all  been  too  busy  with 
the  revolution  to  be  bothered  with  satisfying  such  a  need. 


This  idea  is  not  unique.  One  of  the  other  major  proponents  of  general- 
purpose  parallel  computing  is  David  May  of  INMOS.  In  an  invited  lecture  at  the  the 
Transputing  ’91  conference  [Ref.  14],  he  highlighted  features  that  general-purpose 
parallel  hardware  should  deliver.  Among  the  important  components  of  a  general 
approach,  May  included  the  following: 

•  Scaling.  Performance  must  scale  with  number  of  processors.  Efficiency  is 
partly  dependent  on  problem  size,  but — with  adequate  problem  size — systems 
of  a  thousand  processors  should  be  within  technological  reach.  Each  processor 
is  expected  to  achieve  10®-10®  flops. 

«  Portability.  This  is  almost  synonymous  with  “general  purpose.”  May  empha¬ 
sizes  algorithms  based  upon  features  common  to  many  machines,  and  which 
remain  valid  as  technology  evolves.  He  stresses  that  this  general  purpose  par¬ 
allel  architecture  will  benefit  both  the  computer  designer  and  the  programmer. 
The  designer  will  gain  since  the  market  will  be  somewhat  predictable.  The 
programmer’s  code  will  work  on  several  machines  and  hold  a  strong  hope  for 
working  into  future  years. 

To  achieve  these  goals,  May  proposes  several  guidelines.  First,  for  a  message  passing 
system  using  p  processors,  the  nodes  must  be  capable  of  concurrent  computing  and 
communication.  The  interconnection  topology  must  provide  scalable  throughput 
(linear  in  p)  and  bounded  delay,  probably  log(p).  Programs,  May  believes,  should  be 
written  at  as  high  a  level  as  possible  and  make  use  of  many  processes.  The  algorithm 
should  express  the  maximum  possible  parallelism.  Much  of  May’s  theory  is  based 
upon  the  structure  of  a  hypercube  interconnection  topology  (or  virtual  hypercube). 
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4.  Projections 


Kenneth  Wilson  makes  a  credible  claim  that  says  parallel  computing  is 
here  to  stay.  His  reasoning  is  based  upon  the  fact  that  mass  production  and  heavy 
competition  are  proven  ingredients  in  keeping  the  cost  of  chips  low.  Rather  than 
summarize,  I  will  quote  his  conclusion  (Ref.  12:  p.  185): 

Today  a  single  processing  unit  costing  millions  of  dollars  can  still  be  cost- 
effective  but  I  don’t  think  this  can  last  very  long,  over  a  period  of  time  (I  cannot 
estimate  how  many  years)  it  seems  likely  that  the  maximum  price  of  a  cost-effective 
processor  will  plunge  to  one  hundred  thousand  dollars,  to  ten  thousand  dollars,  to 
???.  I  cannot  estimate  the  ultimate  equilibrium  price  at  which  this  plunge  will  stop. 

Meanwhile  I  can  find  no  prospects  that  single  supercomputer  processors  speeds 
will  advance  at  anything  like  the  pace  at  which  processor  costs  are  being  reduced, 
even  using  Gallium  Arsenide  or  superconducting  Josephson  junctions. 

The  result  of  this  is  inevitable — overall  advances  at  the  supercomputer  level 
have,  to  come  through  parallelism,  namely,  big  increases  in  speed  have  to  come  from 
the  simultaneous  use  of  many  processors  in  parallel. 


David  May  agrees  with  Wilson,  who  states  that  increasingly  complex  com¬ 
ponents  and  faster  clock  speeds  are  not  likely  avenues  of  advancement.  This  makes 
parallel  processing  “technically  attractive.”  He  also  agrees  that  meiss  production  will 
make  the  most  effective  use  of  design  and  production  facilities.  His  conclusion:  “A 
general  purpose  parallel  architecture  would  allow  cheap,  standard  multiprocessors  to 
become  pervasive.”  [Ref.  14] 

May’s  prediction  for  1995  includes  processors  capable  of  100  megaflops. 
INMOS  believes  strongly  in  the  idea  of  balancing  computation  and  communication, 
and  May  projects  that  node  throughputs  will  have  reached  500  megabytes  per  second. 
In  1995’s  multiprocessor  systems,  he  envisions  teraflop  performance.  By  2000,  May 
projects  “scalable  general  purpose  parallel  computers  will  cover  the  performance 
range  up  to  10"  flops.  Specialised  parallel  computers  will  extend  this  to  10’^  flops.” 
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D.  OVERVIEW 


This  chapter  has  surveyed  the  (relatively  recent)  history  of  computing,  consid¬ 
ered  the  state-of-the-art,  and  made  a  few  guesses  as  to  the  future.  Additionally,  it 
ha5  introduced  numerical  and  parallel  computing.  This  serves  as  a  backdrop  for  the 
remainder  of  the  thesis.  Chapter  II  expands  the  background  on  parallel  processing 
and  numerical  methods.  The  latter  provides  a  lead-in  to  the  specific  algorithms  and 
theory  that  appear  in  Chapter  III.  Chapter  IV  introduces  the  parallel  design  and 
methods  used  in  the  work.  A  description  of  the  environment,  tools,  and  equipment 
appears  in  Chapter  V.  Results  and  conclusions  appear  in  Chapters  VI  and  VII. 

Appendices  are  provided  to  keep  the  chapters  concise  and  focused.  The  ap¬ 
pendix  material  operates  on  both  sides  of  that  focus.  Some  of  the  material  is  de¬ 
signed  to  give  sufficient  background  and  the  rest — code  mostly — is  provided  for  more 
in-depth  study.  The  background  material  may  be  obvious  to  some  readers  and  new 
to  others.  I  have  assumed  that  the  reader  has  some  knowledge  of  the  background 
material.  I  do  not  presume  that  the  reader  will  be  familiar  with  the  code. 

To  simplify  the  discussion  we  must  speak  the  same  language.  Appendix  A 
gives  the  basic  terms  and  notation  used  in  the  rest  of  the  thesis.  Next,  we  discuss 
the  machines  used  to  perform  the  work.  While  this  is  the  subject  of  Chapter  V,  a 
more  detailed  account  is  reserved  for  Appendix  B.  Appendix  C  provides  a  general 
background  on  interconnection  topologies.  Emphasis  is  placed  upon  the  hypercube 
connection  scheme.  Appendix  D  describes  the  process  whereby  a  real-world  problem 
is  translated  into  matrix  notation.  Appendix  E  gives  some  information  and  results  for 
communications  performance  in  a  hypercube.  Finally,  Appendix  F  provides  listings 
for  most  of  the  code  used  in  the  research. 
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II.  BACKGROUND 


Mathematics  is  the  door  and  key  to  the  sciences. 

—  ROGER  BACON 

Chapter  I  provided  a  backdrop,  showing  the  state  of  scientific  computing,  es¬ 
pecially  parallel  and  distributed  forms,  today.  In  the  present  chapter,  the  scope 
is  limited  to  material  and  equipment  pertaining  to  this  research.  The  thesis  work 
deals  with  methods  of  conjugate  directions  implemented  upon  two  contemporary 
MIMD  machines.  The  goal  is  to  introduce  the  theory,  machines,  methods,  and  a  few 
peripheral  issues  that  will  be  helpful  as  background  information. 

A.  COMPUTING  WITH  REAL  NUMBERS 

As  illustrated  in  Figure  1.1,  the  speed  of  computing  machinery  has  risen  swiftly 
since  the  1940s.  This  has  often  been  encouraged  by  substantial  advances  in  tech¬ 
nology.  Today’s  multiprocessor  machines  seem  to  be  maintaining  the  fast-paiced 
growth.  Additionally — although  precision  is  a  less  glamorous  business  than  speed — 
the  accuracy  of  machine  solutions  has  become  more  standard.  This  section  considers 
some  of  the  principal  issues  of  computing  with  finite  approximations  of  real  numbers. 

We  have  observed  that  the  history  of  computing  shows  close  ties  to  science  and 
mathematics.  As  the  design  and  construction  of  computers  becomes  a  more  spe¬ 
cialized  business — mostly  performed  b}'  electrical  and  computer  engineers — we  still 
find  that  many  of  the  fundamental  requirements  are  related  to  scientific  problems. 
These  problems  typically  involve  mathematics  and  a  significant  amount  of  scientific 
computing  applies  numerical  methods  that  involve  real  numbers.  The  trend  in  com- 
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puter  (hardware  and  software)  design  is  toward  abstraction,  but  from  time  to  time 
we  absolutely  must  understand  and  work  with  the  underlying,  concrete  principles. 

1.  Finite-Precision 

New  problems  are  generated  as  the  speed  of  computing  machinery  improves 
with  each  generation  of  machines.  One  question  to  be  considered  is,  how  reliable 
are  the  machines  and  the  software  that  runs  on  them?  This  is  a  constant  concern 
in  computing.  Many  scientific  problems  involve  continuous  phenomena  in  the  real 
world.  Accordingly,  we  like  to  be  able  to  represent  the  real  numbers,  S,  within  the 
machine.  But,  lacking  infinite  storage,  this  is  impossible.  There  have  been  several 
more-or-less  reasonable  ideas  and  implementations  of  approximations  to  the  real 
numbers  within  the  limits  of  computer  storage.  Of  these,  the  floating-point  concept 
of  storage  and  airithmetic  enjoys  the  most  widespread  use. 

The  Institute  of  Electrical  and  Electronics  Engineers  (IEEE)  has  established 
the  principal  standards  for  floating-point  representations  and  arithmetic.  These 
standards  make  machine  arithmetic  more  predictable.  Surprisingly,  while  they  exist 
in  much  of  today’s  computing  hardware,  the  standards  are  not  widely  understood  by 
practitioners.  Then,  software  and  applications  are  sometimes  formed  in  ignorance. 
The  title  of  David  Goldberg’s  paper  [Ref.  15]  speaks  volumes;  “What  Every  Com¬ 
puter  Scientist  Should  Know  About  Floating-Point  Arithmetic.”  Goldberg  is  also 
responsible  for  several  other  contributions  describing  floating-point  arithmetic  and 
the  IEEE  standards.  Appendix  A  of  Hennessy  and  Patterson’s  book  on  architec¬ 
ture  [Ref.  3]  is  such  a  contribution.  He  gives  a  very  useful  description  of  the  IEEE 
standards  and  instruction  on  how  to  perform  arithmetic  operations  on  machines  that 
adhere  to  the  IEEE  standards. 
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2.  IEEE  754 


Of  the  four  precisions  specified  by  the  IEEE  754-1985  standard,  this  thesis 
uses  the  double  precision  format  most  often  (to  approximate  real  numbers)  so  it 
will  receive  the  most  attention.  In  the  C  programming  language,  these  numbers 
correspond  to  the  type  double.  They  are  floating-point  values  stored  in  eight  bytes 
(64  bits).  The  storage  representation  is  illustrated  as  three  components:  one  sign  bit, 
s;  an  11-bit  exponent,  c;  and  a  52-bit  fraction,  f.  Figure  2.1  shows  an  example.  We 
say  that  c  is  a  biased exponent.  Both  negative  and  positive  exponents  are  stored  using 
a  range  of  positive  binary  numbers  biased  about  (nearly)  the  middle.  Significand  or 
mantissa  is  the  name  given  to  the  number  (1./).  The  fraction  is  a  packed  form  of 
the  significand.  This  means  that  the  leading  one  of  the  significand  is  implicit.  This 
is  called  a  normalized  number.  [Ref.  16] 

All  IEEE  floating-point  numbers  are  normalized  except  for  the  special  rep¬ 
resentations  when  e  =  00000000000  =  0  or  e  =  11111111111  =  2047.  These  are 
called  denormalized  (or  subnormalized)  numbers.  Only  the  fraction,  /,  of  a  normal¬ 
ized  number  is  stored  [Ref.  3;  p.  A-14].  Figure  2.1  shows  a  representation  of  the 
floating-point  number,  x  =  7.0.  First,  x  is  shown  as  it  would  be  defined  in  a  C 
program.  The  C  address  of  operator,  iz,  is  used  to  indicate  the  address  of  z  in  mem¬ 
ory.  That  is,  somewhere  (namely  &:x)  in  memory,  there  are  eight  contiguous  bytes 
that  hold  a  floating-point  representation  of  z  and  (for  illustration  purposes)  we  can 
imagine  the  IEEE  754  double-precision  representation  of  z  as  Figure  2.1  indicates. 

A  standard,  such  as  IEEE  754  (and  the  lesser-known  IEEE  854),  is  not  a 
panacea  for  the  finite-precision  problem  but  it  lends  tremendous  support  to  those 
who  would  scientifically  deal  with  the  problems  of  finite-precision  arithmetic.  Pro¬ 
grams  given  in  the  files  num.sys.h  and  num.sys.c  (in  Appendix  F)  are  of  interest 
to  those  who  would  explore  further.  The  programs  can  demonstrate  that  the  actual 
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double  X  ■  7.0; 


10000000001 


1100000000000000000000000000000000000000000000000000 


c  =  1025 


f  =  -lh 


Interpretation; 


X  =  (-1*)  X  I./2  X  2'-*®23 

=  (-1®)  X  I.II2  X  2>“25-1023 

=  1.112x4 
=  1112 

=  7 


Figure  2.1:  IEEE  754  Representation:  Double  Precision 


order  and  location  of  bits  in  memory  may  not  match  the  representation  of  Fig¬ 
ure  2.1.  This  reflects  practicalities  concerning  storage  and  transmission  of  bytes  at 
a  very  low  level  in  the  machine.  It  is  perfectly  reasonable  (and  easier)  to  use  the 
common  abstraction  of  Figure  2.1  regardless  of  machine  implementation. 


B.  NUMERICAL  ISSUES 
1.  The  Need 


Consider  the  problem  of  determining  the  area  under  a  bounded  function 
f{x)  over  a  closed  interval  [a,  6].  Numerical  quadrature  (integration)  rules  such  as 
the  Trapezoidal  Rule  or  Simpson’s  Rule  are  used  to  arrive  at  an  approximating  (or 
Riemann)  sum  of  many  smaller  areas  within  the  region.  Numerical  methods  are 
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often  used  to  approximate  the  solution  to  a  problem.  This  is  no  trivial  problem.  To 
solve  it  (numerically)  by  anything  other  than  accident,  one  must  first  understand 
the  theory  and  analytical  approach.  Next,  the  problem  can  be  translated  into  an 
alporithm  (a  plan — usually  mathematical  in  nature — for  solving  the  problem  step- 
by-step)  which  can,  in  turn,  be  translated  into  the  sort  of  language  that  a  machine 
understands. 

This  is  a  relatively  simple  approximation  problem  compared  to  the  problem 
of  finding  the  solution  to  a  system  of  500  equations  in  500  unknowns.  Consider  the 
(perhaps  more  realistic)  problem  of  using  numerical  linear  algebra  to  solve  an  elliptic 
partial  differential  equation  like  the  one  presented  in  Appendix  D.  Numerical  con¬ 
cerns  abound  in  problems  such  as  these.  Additionally,  many  problems  in  numerical 
linear  algebra  have  time  complexities  of  0(n^)  or  0(n^)  and  storage  requirements  of 
0(n^)  so  speed  is  essential.  (Appendix  A  reviews  the  complexity  notation  such  as 
big-Oh  and  big-Theta). 

2.  Errors  and  Blunders 

A  clear  understanding  of  the  differences  between  errors  and  blunders  is 
important  since  recognition  of  the  source  of  error  is  prerequisite  to  eliminating  or 
reducing  them.  The  terms  are  introduced  in  [Ref.  17;  p.  1]: 

Blunders  result  from  fallibility,  errors  from  finitude.  Blunders  will  not  be 
consi  lered  here  to  any  extent.  There  are  fairly  obvious  ways  to  guard  against  them, 
and  their  effect,  when  they  occur,  can  be  gross,  insignificant,  or  anywhere  in  be¬ 
tween.  Generally  the  sources  of  error  other  than  blunders  will  leave  a  limited  range 
of  uncertainty,  and  generally  this  can  be  reduced,  if  necessary,  by  additional  labor. 

It  is  important  to  be  able  to  estimate  the  extent  of  the  range  of  uncertainty. 

-  ALSTON  S.  HOUSEHOLDER 
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3.  The  Issues 


To  anticipate — or  even  troubleshoot — error  we  must  know  from  whence  it 
comes.  In  [Ref.  17 :  p.  2],  Alston  Householder  lists  the  four  sources  of  error  that 
were  set  forth  by  John  von  Neumann  and  Herman  Goldstine: 

•  Mathematical  formulations  are  seldom  exactly  descriptive  of  any  real  situation, 
but  only  of  more  or  less  idealized  models.  Perfect  gases  and  material  points  do 
not  exist. 

•  Most  mathematical  formulations  contain  parameters,  such  as  lengths,  times, 
masses,  temperatures,  etc.,  whose  values  can  be  had  only  from  meaisurement. 
Such  measurements  may  be  accurate  to  within  1,  0.1,  or  0.01  percent,  or  better, 
but  however  small  the  limit  of  error,  it  is  not  zero. 

•  Many  mathematical  equations  have  solutions  that  can  be  constructed  only  in 
the  sense  that  an  infinite  process  can  be  described  whose  limit  is  the  solution 
in  question.  By  definition  the  infinite  process  cannot  be  completed.  So  one 
must  stop  with  some  term  in  the  sequence,  accepting  this  as  the  adequate 
approximation  to  the  required  solution.  This  results  in  a  type  of  error  called 
the  truncation  error. 

•  The  decimal  representation  of  a  number  is  made  by  writing  a  sequence  of  digits 
to  the  left,  and  one  to  the  right,  of  an  origin  which  is  marked  by  a  decimal 
point.  The  digits  to  the  left  of  the  decimal  point  are  finite  in  number  and 
are  understood  to  represent  coefficients  of  decreasing  powers  of  10.  In  digital 
computation  only  a  finite  number  of  these  digits  can  be  taken  account  of.  The 
error  due  to  dropping  the  others  is  called  the  round-off  error.  .  .  . 
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C.  MACHINE  METHODS 


We  would  like  to  somehow  characterize  the  techniques  that  make  a  problem¬ 
solving  method  “good”.  The  abilities  of  machines  and  people  are  distinct  enough  that 
we  should  not  always  expect  an  algorithm  for  machine  solution  to  mirror  the  pencil- 
and-paper  method  of  an  individual.  Hestenes  and  Stiefel  make  this  distinction,  defin¬ 
ing  a  hand  method  as  “one  in  which  a  desk  calculator  may  be  used”  and  a  machine 
method  as  “one  in  which  sequence-controlled  machines  are  used.”  [Ref.  18:  p.  409] 
Further,  in  the  same  reference,  they  list  the  following  characteristics  that  a  good 
machine  method  exhibits: 


(1)  The  method  should  be  simple,  composed  of  a  repetition  of  elementary 
routines  requiring  a  minimum  of  storage  space. 

(2)  The  method  should  insure  rapid  convergence  if  the  number  of  steps  re¬ 
quired  for  the  solution  is  infinite.  A  method  which — if  no  rounding-off  errors 
occur — will  yield  the  solution  in  a  finite  number  of  steps  is  to  be  preferred. 

(3)  The  procedure  should  be  stable  with  respect  to  rounding-off  errors.  If 
needed,  a  subroutine  should  be  available  to  insure  this  stability.  It  should  be  possible 
to  diminish  rounding-off  errors  by  a  repetition  of  the  same  routine,  starting  with 
the  previous  result  as  the  new  estimate  of  the  solution. 

(4)  Each  step  should  give  information  about  the  solution  and  should  yield  a 
new  and  better  estimate  than  the  previous  one. 

(5)  As  many  of  the  original  data  os  possible  should  be  used  during  each  step 
of  the  routine.  Special  properties  of  the  given  linear  system — such  as  having  many 
vanishing  coefficients — should  be  preserivd.  (For  example,  in  the  Gauss  elimination 
special  properties  of  this  type  may  be  destroyed.) 

D.  CONJUGATE  DIRECTIONS 

Hestenes  and  Stiefel  describe  the  method  of  conjugate  directions  (CD).  This  is 
a  general  approach  to  solving  systems  of  linear  equations  that  uses  direction  vectors, 
Po^  Pi  1  •••i  to  determine  how  the  search  for  a  solution  should  proceed  from  stei>- 
to-step.  When  the  method  for  determining  these  vectors  is  defined,  CD  becomes  a 
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specific  method.  There  are  at  least  two  of  these  specific  methods  within  CD  that 
are  especially  suited  to  computer  implementation:  Gauss  factorization  (GF)  and  the 
method  of  conjugate  gradients  (CG).  [Ref.  18:  p.  412] 

The  term  conjugate  is  clearly  an  important  one  for  these  methods.  Given  a 
matrix  A  €  S"*"  that  is  symmetric,  we  say  that  two  vectors  x  and  y  are  conjugate 
if 

x'^Ay  =  iAx)'^y  =  0 .  (2.1) 

There  is  an  alternative  term  that  emphasizes  the  role  of  A  in  this  definition.  We  also 
say  that  x  and  y  are  A-orthogonal.  [Ref.  18:  p.  410] 

The  method  of  conjugate  gradients  chooses  its  direction  vectors,  p,,  to  be  mutu¬ 
ally  conjugate  (pj Apj  =  0  whenever  i  ^  j)  and  in  such  a  manner  that  p,+i  depends 
upon  Pi.  (A  specific  formula  is  given  near  the  end  of  Chapter  III).  The  Gauss  fac¬ 
torization  chooses  Pi  =  e,-,  the  i*^  axis  vector.  [Ref.  18:  pp.  412,425-427] 

In  this  research,  the  Gauss  method  gets  almost  all  of  the  attention,  but  the 
method  of  conjugate  gradients  receives  a  short  overview  near  the  end  of  Chapter  III. 
The  theory  of  conjugate  directions  is  not  at  all  trivial,  and  the  ties  of  Gauss  and 
conjugate  gradients  to  conjugate  directions  are  fairly  deep.  These  issues  are  covered 
in  the  work  of  Hestenes  and  Stiefel  [Ref.  18].  This  thesis  develops  the  Gauss  method 
from  an  implementation  standpoint. 

E.  PARALLEL  PROCESSING 

The  field  of  parallel  and  distributed  computing  is  a  relatively  new  one.  In 
one  sense,  it  is  quite  natural.  We  perform  work  in  parallel  every  day.  In  fact,  a 
manager-worker  notion  is  a  very  useful  means  to  understand  the  issues  of  this  field. 
The  programs  developed  in  this  research  involve  a  host  or  manager  and  nodes  or 
workers.  This  is  often  called  the  workfarm  approach. 
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The  principal  “problem”  in  parallel  computing  is  communication.  Appendix  C 
relates  some  of  the  considerations.  Of  course,  there  are  other  concerns  as  well:  load 
balancing,  problem  size  (granularity),  and  so  on.  These  issues,  as  they  apply  to  the 
this  research,  are  discussed  in  Chapter  IV. 

The  bottom  line — after  all  of  the  design  and  implementation  work — is  perfor¬ 
mance.  With  multicomputers,  as  in  a  workfarm,  we  are  after  efficiency  so  that  more 
computing  can  be  done  in  a  shorter  time  and  for  less  money.  Bell  is  even  more 
specific.  He  believes  the  multicomputer  must  offer  two  key  facilities  to  become  es¬ 
tablished  [Ref.  6:  p.  1097]: 

•  Power  that  is  not  otherwise  available. 

•  Performance  for  a  price  that  is  “at  least  an  order  of  magnitude  cheaper  than 
traditional  supercomputers.” 

In  Chapter  VI,  we  consider  results  obtained  upon  two  contemporary  parallel 
machines.  This  information  helps  us  to  evaluate  the  potential  of  MIMD  architectures 
in  terms  of  Bell’s  criteria. 

F.  SPEEDUP 

The  terms  speedup  and  efficiency,  defined  in  Appendix  A,  capture  most  of  the 
interest  when  we  talk  about  the  potential  of  parallel  computing.  The  principal  reason 
for  choosing  a  multicomputer  over  a  single  computer  is  speed.  Therefore,  we  are  most 
interested  in  knowing  what  kind  of  speed  we  can  obtain  from  a  multiprocessor  system. 
Bell’s  comments  on  price  are  germane  as  well. 
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Speedup  and  efficiency  are  both  machine  dependent  and  problem  dependent. 
Some  problems  should  not  be  executed  on  a  parallel  machine!  Suppose,  for  instance, 
that  part  of  a  problem  must  be  performed  sequentially.  Amdahl’s  law  is  a  well-known 
attempt  to  characterize  this  problem.  Amdahl  stated  that  speedup  on  P  processors, 
5,  is  limited  in  the  following  manner: 

-/  +  (l-/)/P 

where  /  is  “the  fraction  of  operations  in  a  computation  that  must  be  performed 
sequentially,  where  0  <  /  <  1”  [Ref.  19  :  p.  19].  With  speedup,  S,  defined  as 
in  (2.2)  we  see  that 

lim  5  =  i.  (2,3) 

P^OO  J 

Figure  2.2  shows  how  this  limit  begins  to  take  effect  as  the  number  of  processors, 
P,  is  increased  from  zero  to  500.  The  figure  is  based  on  Amdahl’s  law  (2.2)  with 
sequential  percentages,  /,  of  5%,  10%,  and  25%. 

We  can  see  that  Amdahl’s  law  has  some  very  discouraging  news  for  so-called 
massivc/t/ parallel  computing.  The  massive  part  of  the  term  is  loosely  defined,  appar¬ 
ently  meaning  “many”  processors.  But  Amdahl’s  law  may  be  b£ised  upon  a  faulty 
assumption  [Ref.  20].  Consider  the  following  reasoning.  Let  P  be  the  number  of 
processors  and  consider  the  following  argumenis  concerning  time.  Let  s  be  the  time 
required  to  execute  the  serial  portions  of  a  program  on  a  serial  processor  and  let 
p  be  the  amount  of  time  required  to  complete  the  parallel  work  on  the  same  serial 
processor.  Using  this  notation,  and  normalizing  (s  -t-  p  =  1),  Amdahl’s  law  can  be 
restated 

s  +  {plP)  s-Kp/P)- 

Then,  if  we  consider  the  case  P  =  1,024  with  s  <  10%,  we  see  in  Figure  2.3,  that 
speedup  is  severely  restricted. 
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Speedup 


Speedup 
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G.  SCALED  SPEEDUP 


These  problems  with  the  usual  notion  of  speedup  led  Gustafson,  Montry,  and 
Benner  to  question  the  validity  of  Amdahl’s  assumptions  [Ref.  20:  p.  3]: 

The  expression  and  graph  are  based  on  the  implicit  assumption  that  p  is 
independent  of  P.  However,  one  does  not  generally  take  a  fixed  size  problem  and 
run  it  on  various  numbers  of  processors;  in  practice,  a  scientific  computing  problem 
scales  with  the  available  processing  pouter.  The  fixed  quantity  is  not  the  problem 
size  but  rather  the  amount  of  time  a  user  is  willing  to  wait  for  an  answer;  when 
given  more  computing  power,  the  user  expands  the  problem  (more  spatial  variables, 
for  example )  to  use  the  available  hardware  resources. 

As  a  first  approximation,  we  have  found  that  it  is  the  parallel  part  of  a  pro¬ 
gram  that  scales  with  the  problem  size.  Times  for  program  loading,  serial  bottle¬ 
necks,  and  I/O  that  make  up  the  s  component  of  the  application  do  not  scale  with 
the  problem  size.  When  we  double  the  number  of  processors,  we  double  the  number 
of  spatial  variables  in  a  physical  simulation.  As  a  first  approximation,  the  amount 
of  work  that  can  be  done  in  parallel  varies  linearly  with  the  number  of  processors 


Based  upon  this  analysis,  they  present  the  notion  of  scaled  speedup.  They  let 
s'  and  p'  represent  the  serial  and  parallel  time  spent  on  a  parallel  system  (inverse  of 
Amdahl’s  method).  So  that  s'  +  p'  =  1  and  a  uniprocessor  requires  time  s'  +  p'P  to 
perform  the  task.  With  these  definitions,  they  define  scaled  speedup.  S',  to  be 

If  we  consider  the  same  range  of  serial  fractions  as  we  did  in  Figure  2.3,  we  see  that 
scaled  speedup  is  much  better  than  the  usual  speedup.  Figure  2.4  shows  the  plot  of 
scaled  speedup. 

H.  SUMMARY 

This  chapter  considers  the  background  necessary  to  develop  the  algorithms 
(Chapters  III  and  IV)  and  implement  them  (Chapter  V).  Algorithms  are  described 
as  sequential  plans  first  (Chapter  III).  The  Gauss  factorization  algorithm  is  given 
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Figure  2.4:  Scaled  Speedup 

in  detail  (Chapter  III),  including  a  discussion  on  the  significance  of  pivoting.  The 
method  of  conjugate  gradients  receives  less  attention,  but  a  brief  introduction  is 
given  near  the  end  of  Chapter  III.  The  parallel  considerations  surveyed  quickly  in 
this  chapter  receive  more  attention  in  Chapter  IV. 
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III.  THEORY 


No  human  investigation  can  be  called  real  science  if  it  cannot  be  demonstrated 
mathematically. 

—  LEONARDO  DA  VINCI  (1452-1519) 

A.  SCOPE 

The  goal  of  this  research  is  to  demonstrate  a  parallel  method  for  solving  a 
system  of  linear  equations.  The  implementation  targets  two  contemporary  MIMD 
architectures;  the  Intel  iPSC/2  and  networks  of  INMOS  transputers.  There  are  many 
methods  for  solving  linear  systems.  This  work  concentrates  primarily  upon  Gauss 
factorization  (GF),  but  the  method  of  conjugate  gradients  (CG)  is  also  introduced. 
Regrettably,  CG  is  not  developed  due  to  time  constraints  (the  derivation  is  not 
trivial).  This  does  not  imply  that  Gauss  factorization  is  superior,  nor  that  it  possesses 
greater  potential  for  parallel  solution.  Indeed,  Hestenes  and  Stiefel  preferred  CG  to 
GF  for  a  number  of  very  good  reasons  [Ref.  18:  p.  409]. 

As  we  shall  see,  the  utility  of  either  method  is  quite  dependent  upon  the  nature 
of  the  particular  problem.  Consider  the  system  of  linear  equations  represented  by 

Au  =  b.  (3.1) 

Much  of  the  subsequent  discussion  applies  to  general,  rectangular  systems  w’here 
A  €  For  the  examples,  however,  square  systems  {A  €  Si”*")  are  used.  This 

restriction  greatly  simplifies  the  discussion  without  losing  much  of  the  concept  as 
it  applies  to  general  systems.  The  Gauss  process,  i.e.,  the  main  part  of  the  work, 
excluding  the  stopping  criteria  and  interpretation  of  the  result,  is  the  same  in  all 
three  cases  {rn  <  n,  m  =  n,  and  m  >  n). 
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To  be  sure,  the  three  cases  {m  <  n,  m  —  n,  and  m  >  n)  correspond  to  funda¬ 
mentally  different  real-world  systems,  but  the  algorithms  for  each  case  are  almost 
identical.  The  restriction  to  a  square  system  will  greatly  simplify  the  discussion 
without  blinding  us  to  the  general,  rectangular  case.  The  extensions  to  the  general 
case  are  well  known.  Golub  and  Van  Loan  [Ref.  21 :  p.  102]  give  more  detail,  but  the 
square  case  is  most  expedient  for  now.  Square  systems  also  simplify  the  experimental 
procedure,  data  collection  and  analysis. 

The  Gauss  method  follows  naturally  from  a  hand  method  and  it  holds  strong 
appeal  to  intuition.  Without  a  pivoting  strategy,  however.  Gauss  can  attempt  division 
by  zero.  There  is  also  a  more  subtle  issue  of  rounding  errors  within  the  limits  of 
finite-precision  arithmetic.  To  forestall  errors  of  both  kinds,  partial  and  complete 
pivoting  strategies  are  used.  This  chapter  develops  the  (sequential)  algorithms  and 
explains  the  concept  of  pivoting.  This  is  a  sensible  starting  point  for  Chapter  IV, 
where  parallel  versions  of  the  algorithms  are  given. 

B.  APPROACH 

There  are  many  methods  that  may  be  applied  to  determine  the  solution  of  a 
system  of  linear  equations.  The  methods  were  designed  for  different  reasons  and 
with  different  problems  in  mind,  so  each  exhibits  a  unique  behavior.  One  method 
is  often  preferred  over  another  for  a  given  problem.  Ultimately,  the  criterion  is 
performance,  both  in  reliability  and  speed.  The  approach  described  here  and  in  the 
remaining  chapters  seeks  to  “maximize  performance”  while  retaining  a  reasonable 
balance  of  both  efficiency  and  quality.  Speed  and  numerical  accuracy  tend  to  oppose 
one  another  so  we  are  left  to  choose  from  several  options. 

A  hand  method  introduces  each  algorithm.  The  example  is  small  and  concrete. 
Solving  a  small  problem  gives  useful  insights  into  the  algorithms.  Once  the  hand 
method  is  established,  it  is  expressed  in  an  equivalent  matrix  notation.  A  high-level 
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sequential  algorithm  is  built  upon  this  foundation.  This  algorithm  shows  how  a 
machine,  using  a  sequence  of  instructions,  solves  the  problem.  It  also  gives  good  es¬ 
timates  for  the  problem’s  time  and  storage  complexities.  The  sequential-to-parallel 
transition  involves  enough  issues  to  warrant  separate  coverage.  These  considerations 
appear  in  Chapter  IV. 

In  the  sections  that  follow,  Gaussian  elimination  is  presented  first.  It  reveals  the 
background  (sort  of  a  first  pass)  for  Gauss  factorization.  Once  the  reduction  process 
is  understood,  we  proceed  to  factorization.  A  description  of  the  method  of  conjugate 
gradients  is  given  at  the  end  of  the  chapter.  This  method,  due  to  Hestenes  and 
Stiefel,  is  based  upon  relatively  deep  theory.  Thus  the  derivations  and  background 
are  not  included.  Nevertheless,  a  synopsis  of  the  method  is  given. 

C.  APPLYING  THE  METHODS 

A  particular  method  is  often  tailored  to  a  specific  type  of  system.  The  method 
of  conjugate  gradients,  for  instance,  is  usually  used  when  the  matrix  of  coefficients, 
A.  is  symmetric  and  positive  definite  [Ref.  18:  p.  411].  The  Gauss  factorization 
algorithm  is  equally  important,  but  it  takes  quite  another  approach  to  solving  this 
system.  Both  CG  and  GF  lie  within  the  broad  category  of  methods  of  conjugate 
directions  (Chapter  II).  Indeed  both  work  in  just  about  any  case.  But,  the  better 
results  are  obtained  by  using  the  tool  that  fits  the  teisk  at  hand. 

A  very  rough  characterization  of  the  problem  can  simplify  algorithm  selection. 
We  will  look  for  two  qualities:  structure  and  density.  CG,  for  instance,  performs 
best  when  applied  to  highly  structured,  sparse  matrices  (i.e.,  matrices  wdth  many  zero 
entries).  Systems  like  the  sparse,  symmetric,  highly-structured  result  of  Appendix  D 
deserve  careful  solutions  that  do  not  destroy  the  existing  zeros.  Zeros  are  not  always 
easy  to  come  by.  Gaussian  elimination  must  expend  2n^/3  flops  to  create  them. 
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Selecting  the  wrong  algorithm  can  lead  to  slower  execution.  More  importantly, 
poor  algorithm  choice  is  a  blunder  (Chaper  II).  It  can  produce  results  that  are  ac¬ 
cidentally  perfect,  grossly  incorrect,  or  anywhere  between.  Therefore,  no  less  than 
three  tasks  confront  us: 

•  Characterize  the  problem.  In  systems  like  (3.1),  attributes  of  the  matrix  of 
coefficients.  A,  may  provide  a  wealth  of  information. 

•  Understand  the  algorithm(s).  Know  the  types  of  problem(s)  it  is  designed  for 
(and,  more  importantly,  know  why). 

•  Create  or  select  an  algorithm  that  suits  the  problem. 

The  sparse,  highly-structured  problems  are  not  rare!  Anyone  who  has  observed 
nature  knows  that  many  natural  phenomena  exhibit  incredible  structure  and  sim¬ 
plicity.  Strategies  for  solving  the  corresponding  system  should  always  seek  to  exploit 
these  characteristics.  Both  sparseness  and  structure  can  reduce  storage  requirements 
and  the  number  of  flops  required.  If  we  know  the  structure  in  advance,  there  may 
be  a  smart  way  to  avoid  some  calculations  entirely  or  minimize  the  work  involved. 
(Recall  Hestenes  and  Stiefel’s  characterization  of  a  “good”  machine  method  from 
Chapter  II).  Other  problems,  when  translated  into  the  form  (3.1),  exhibit  a  dense 
matrix,  A,  with  little  or  no  apparent  structure. 

These  two  types  of  problems  should  not  be  handled  with  the  same  tools.  As 
with  many  computational  problems,  the  reasons  involve  the  use  of  time  and  space. 
We  shall  see  that  the  Gauss  algorithm  has  time  complexity  0(n^)  and  storage  re¬ 
quirements  0(n*).  (Complexity  notation  appears  in  Appendix  A).  Numbers  like 
these  grow  rapidly  with  n  and,  regardless  of  how  much  memory  is  available,  the 
problem  can  quickly  overpower  the  computer.  A  naive  approach  to  problems  of 
these  kinds  can  be  expensive  in  terms  of  both  storage  and  time.  This  is  usually 
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adequate  incentive  to  take  advantage  of  sparseness  and  structure  whenever  possible. 
When  it  is  not  possible,  Gauss  is  a  good  choice. 

D.  GAUSSIAN  ELIMINATION 

Suppose  that  we  want  to  solve  a  system  of  linear  equations  using  a  systematic, 
step-by-step  method.  We  assume  that  the  system  of  linear  equations  is  given,  and 
that  the  method  must  preserve  the  original  properties  of  the  system.  That  is,  the 
method  must  be  restricted  to  certain  operations;  namely; 

•  Multiply  an  equation  bj'  a  nonzero  constant. 

•  Interchange  equations. 

•  Add  a  multiple  of  one  equation  to  another. 

The  fact  that  the  first  two  operations  do  not  change  the  system’s  properties  is  ev¬ 
ident.  The  third  operation  is  legitimate  also — maybe  not  quite  so  obviously — and 
computationally,  the  most  significant.  Now  let  us  apply  some  of  these  operations  to 
a  system  of  four  equations  in  the  four  unknowns,  t>i,  t>2,  V3,  and  v^. 

2ui  -f  3^2  d"  4i>3  5u4  =  0 

4u;  +  6t’2  -f-  8t>3  -I-  5u4  =  -5  . 

2vi  4v2  -I-  7i>3  -4-  9u4  =  13 

6t’j  -|-  8i.’2  +  8^3  -h  9i’4  =  —17 

Let  m  (=  4)  be  the  number  of  equations,  and  let  n  (=  4)  be  the  number  of  unknowns 
in  each  equation.  Additionally,  let  i  be  an  equation  {or  row)  index  (1  <  t  <  m)  and 
let  j  indicate  a  subscript  of  v  (column  index)  so  that  1  <  ;  <  n.  Finally,  let  a,j  be  the 
coefficient  of  Vj  in  equation  i  (e.g.,  012  =  3).  Suppose  that  the  last  equation  contains 
only  one  nonzero  coefficient  (say  Q44)  and  the  third  equation  heis  only  two  nonzero 
coefficients  (033  ai.d  Q34)  and  so  on.  This  defines  a  triangular  system  (Appendix  A). 
The  triangular  system  is  our  goal  because  it  is  easier  to  solve  (by  back  substitution) 
than  the  current  (square,  dense)  system. 
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Next,  observe  that  a  triangular  system  would  result  if  we  could  eliminate  every 
coefficient,  q,i,  of  Oi  in  all  equations  but  the  first  (i  >  1),  coefficients,  0^2,  of  in 
the  last  two  equations  (i  >  2),  and  the  coefficient,  043,  of  V3  in  the  final  equation.  To 
do  this,  we  work  by  stages.  At  stage  k,  the  coefficient,  Okk,  of  Vk  in  the  k*''  equation 
is  called  the  pivot.  This  term  has  little  significance  now  but  is  clarified  later  (and 
it  plays  a  very  important  role  in  the  examples  presented.  In  a  particular  stage,  k, 
the  goal  is  to  operate  upon  all  equations  i  where  i  6  {(t  +  1),  (A:  +  2), . . . ,  m}  and 
eliminate  all  coefficients,  q,^,  of  Vk. 

1.  A  Hand  Method 

Before  attempting  to  describe  an  algorithm  for  a  machine  solution,  we  con¬ 
sider  an  application  of  Gaussian  elimination  (GE)  by  hand.  Initially,  let  fc  =  1.  In 
the  example  system  (3.2),  the  first  (A*  =  1)  pivot  is  the  coefficient,  Ou  =  2,  of  vi 
in  the  first  equation.  Notice  that  by  subtracting  twice  the  first  equation  from  the 
second,  a  zero  is  produced  under  the  pivot  (eliminating  021).  Similarly,  by  subtract¬ 
ing  the  first  equation  from  the  third,  a  zero  appears  as  the  leading  coefficient  in  the 
third  equation  (eliminating  031).  Finally,  three  times  the  first  equation  subtracted 
from  the  fourth  equation  eliminates  the  coefficient  Q4].  Following  these  steps  the 
altered  system  is: 

2vi  3u2  +  4t’3  +  5i’4  =  0 

..  r!'''  ""  Tq  (3-3) 

V2  +  3^3  +  4v4  —  13 

— U2  —  4^3  —  6U4  =  —17 

This  is  called  the  natural  reduction  process  [Ref.  22:  p.  72].  In  the  particular  case, 
there  are  no  changes  on  the  right-hand  side  because  the  first  equation’s  right-hand 
side  is  zero.  This  makes  for  trivial  arithmetic  on  the  right-hand  side,  but  we  should 
remem^  “r  to  perform  the  arithmetic  upon  whole  equations  (including  the  right-hand 
side)  in  general.  The  elimination  is  even  more  successful  than  planned. 
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The  second  equation  already  has  zeros  where  we  ultimately  wanted  them 
in  the  fourth  equation.  That  is,  the  system  (3.3)  would  be  closer  to  upper  trianguleir 
if  we  were  to  alter  it  by  interchanging  equations  2  and  4. 


2vi  +  3v2  +  4t;3  +  5v4  =  0 

— Uj  —  4t;3  —  6^4  =  —17 

t)2  +  3u3  +  4t;4  =  13 

— 5u4  =  —5 


(3.4) 


The  system  (3.4)  is  called  a  row  permutation  of  (3.3).  The  ability  to  recognize 
patterns  is  a  great  advantage  that  human  problem  solvers  enjoy.  Therefore,  taking 
advantage  of  our  capabilities  we  use  a  rather  subjective  “human”  pivoting  strategy. 
But  it  is  not  fitting  to  assume  that  an  efficient  algorithm  for  a  machine  would  involve 
the  same  sort  of  pattern  recognition. 

The  system  (3.4)  is  nearly  triangular.  The  pivot  moves  to  the  second  equa¬ 
tion  {k  =  2),  and  we  focus  on  the  coefficient,  022  =  —1,  of  Vk  =  U2.  By  adding 
the  second  equation  to  the  third,  the  only  nonzero  coefficient  remaining  in  the  lower 
triangle  (032)  is  eliminated.  The  resulting  system  becomes 

2ui  +  3u2  +  4u3  -h  5v4  =  0 

— 1;2  —  4^3  —  6t;4  =  —17 

-^3  -  2i>4  =  -4 

— 5v4  =  —5 

The  system  is  triangular,  and  it  is  easy  to  solve  for  the  unknown  values,  t?,-,  by  back 
substitution.  By  inspection,  U4  =  1.  Substituting  this  value  into  the  third  equation, 
we  find  that  ^3  =  2.  Substituting  both  values  (1^4  and  V3)  into  the  second  equation 
yields  V2  =  3.  Finally,  by  substituting  the  values  U4,  t;3,  and  V2  into  the  first  equation 
gives  Vi  =  —11.  The  solution  to  the  system  is  then 


(3.5) 
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2.  A  Machine  Method 


The  foregoing  example  illustrated  the  GE  process  as  done  on  paper.  The 
system  was  intentionally  created  for  easy  solution  by  hand  calculation.  I.e.,  it  uses 
integers  and  elimination  occurs  faster  than  the  usual  case.  Even  this  simple  example 
requires  a  few  minutes  to  determine  u  from  the  system  (3.2)  by  hand.  In  Chapter 
VI,  we  see  that  a  machine  can  perform  this  task  in  (much)  less  than  a  second.  For 
this  reason,  it  is  worth  examining  an  equivalent  process  to  solve  for  such  a  system 
by  machine. 

We  reenact  the  solution  from  the  beginning,  this  time  in  a  fashion  that 
a  sequence-controlled  machine  could  perform.  Until  now,  we  have  used  the  term 
“pivot”  but  have  found  no  practical  use  for  pivots.  In  this  example,  we  begin  to 
realize  the  utility  of  a  pivoting  strategy.  We  start  with  “no  pivoting”  and  shift  to 
the  “partial  pivoting”  strategy.  Additionally,  we  begin  to  use  a  more  compact  matrix 
notation.  Appendix  A  describes  the  notation  followed. 

By  the  method  described  in  Appendix  A,  we  give  the  linear  system  (3.2) 
matrix  representation  that  corresponds  to  (3.1): 


■2345' 

■  I'l  ' 

0  ■ 

\ 

4  6  8  5 

V2 
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2  4  7  9 

V3 

13 

03 

.6889. 

.  Vi  . 

.  -17  . 

[04] 

(3.7) 


First,  we  initialize  a  stage  counter,  k,  so  that  k  =  1.  The  pivot  in  stage  k  is  on 
the  diagonal  of  A  (on  =  2).  The  immediate  goal  is  to  produce  zeros  beneath  the 
pivot,  in  v4(2:4, 1).  A  three-step  process  eliminates  these  coefficients  in  row  order: 


•  Divide.  Divide  every  element  beneath  the  pivot  by  the  pivot  value. 


•  Update.  Perform  arithmetic  in  the  Gauss  transform  area. 


•  Eliminate.  Set  the  elements  beneath  the  pivot  equal  to  zero. 
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The  first  step  is  a  division.  The  denominator  (pivot)  is  a^k  =  On  =  2  so 
O21  becomes  the  multiplier  (q2i/2)  =  2.  Similarly,  let  O31  =  1  and  let  Q41  =  3.  Now 


‘2  3  4  5' 
.  2  6  8  5 

14  7  9 

.3889. 


(3.8) 


Next,  consider  everything  below  and  to  the  right  of  the  pivot.  This  is  the  Gauss 
transform  area,  G  =  A{{k 1) :  m,  (^  +  1) :  n)  =  ^4(2: 4, 2: 4).  For  each  element  in 
G,  replace  the  current  value,  a^j,  with  Qjj  —  (ait)(Q;tj)-  Do  the  same  thing  in  the 
corresponding  rows  {i  >  k)  ol  b,  replacing  with  —  {aik){0k)-  We  will  call  this 
the  process  of  performing  arithmetic  in  (or  updating)  the  Gauss  transform  area,  G. 

Finally,  when  the  values  beneath  the  pivot  are  no  longer  needed,  eliminate 
them  (set  them  equal  to  zero).  The  result  is  equivalent  to  the  system  (3.3): 


We  have  finished  one  stage  of  GE.  We  move  into  the  next  stage,  k  =  2.  This  time, 
when  we  try  to  update  G  we  run  into  a  very  serious  problem.  The  first  step  is  to 
divide  everything  underneath  the  pivot  by  the  pivot  value  Qkk  =  Q22  =  0.  This  is 
the  divide-by-zero  problem  of  a  “no  pivoting’'  strategy. 

During  the  execution  of  the  hand  example  we  simply  moved  the  row  to  the 
bottom  of  the  system  to  avoid  this  problem.  Now,  we  could  instruct  the  machine 
to  test  every  element  in  A{k  :  m ,  ic  :  n)  and  interchange  rows  so  that  those  with 
the  most  leading  zeros  were  placed  at  the  bottom.  This  is  problematic  for  several 
reasons.  First,  it  is  not  dependable  (testing  for  equality  of  floating-point  numbers 
begs  disaster).  Secondly — even  if  we  could  identify  zeros  with  confidence — it  would 
add  a  sorting  problem  to  GE!  We  are  not  looking  for  extra  work.  The  solution  is 
partial  pivoting. 
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3.  Partial  Pivoting 


Partial  pivoting  is  an  application  of  row  interchanges  to  eliminate  (primar¬ 
ily)  the  divide-by-zero  problem.  Consider  the  system  of  equations  (3.1)  with  the 
nonsingular  matrix  of  coefficients,  A  €  3?”***"  (i.e.,  m  =  n  and  the  system  has  exactly 
one  solution).  Suppose  further  that  storage  and  arithmetic  is  performed  in  infinite 
precision.  (These  assumptions — infinite  precision  and  A  nonsingular — are  essential). 

Even  in  this  ideal  situation  Gauss  without  pivoting  is  dangerous  because, 
as  we  have  just  seen,  it  may  attempt  to  divide  by  zero.  Proper  row  permutations 
completely  eliminate  this  problem.  Partial  pivoting  will  guarantee  the  existence 
of  n  nonzero  pivots  for  A  nonsingular.  In  fact,  if  we  encounter  a  zero  pivot  with 
partial  pivoting,  it  means  that  A  is  singular  [Ref.  23].  The  remainder  of  this  section 
describes  the  partial  pivoting  strategy. 

Consider  stage  k  of  the  GE  process  with  A  6  3?"*’*”.  The  goal  is  to  pick 
the  “best”  row  remaining  (i.e.,  at  or  below  the  current  pivot)  and  install  it  as  row 
k,  the  pivot  row.  For  reasons  that  are  explained  later,  “best”  shall  mean  the  row 
whose  (pivot  column)  element  is  largest.  Let  s  be  the  row  index  for  the  best 
pivot  candidate.  Initially,  let  s  =  /:  (i.e.,  is  the  first  candidate).  Next,  we  move 
down  the  pivot  column,  considering  all  Oik  where  i  >  k. 

To  eliminate  unnecessary  assignments,  we  replace  the  current  candidate 
with  another  only  if  |Q,fc|  >  |a,fc|.  When  this  occurs,  we  make  sure  that  s  is  updated 
by  setting  it  equal  to  i.  After  considering  all  elements,  o,>,  for  k  <  i  <  m,  s  is  the 
index  of  “best  possible”  pivot  row.  To  accomplish  our  goal,  we  must  perform  a  row 
interchange.  This  is  easy  after  the  new  pivot  row  has  been  determined.  We  simply 
swap  rows  k  and  s  {if  k  ^  s).  Within  the  assumptions  above,  we  have  completely 
eliminated  the  potential  for  division  by  zero.  Now  let  us  return  to  the  problem  at 
hand. 
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4.  A  Machine  Method  (Resumed) 


Applying  partial  pivoting  to  the  system  (3.9),  we  find  that  the  next  pivot 
is  located  at  A{3,2)  so  we  must  interchange  rows  (equations)  two  and  three.  Be¬ 
fore  performing  this  step,  however,  let  us  create  a  vector  to  keep  track  of  the  row 
permutations.  Let  q  €  S’"  be  the  row  permutation  vector.  We  initialize  q  so  that 
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and  perform  row  interchanges  in  q  corresponding  to  those  in  A  so  that  0,  is  always 
the  original  equation  number  for  current  equation  number  i.  Thus,  after  performing 


the  row  interchange,  we  have 

■  2  3  4  5  ■ 

0  13  4 

0  0  0  -5 

0  -1  -4  -6  . 


■  vi  ■ 

0  ■ 

■  1  ■ 

V2 

13 

9  = 

3 

Vz 

-5 

2 

.  V4  . 

.  -17  . 

.  4  . 

(3.11) 


Notice  that  V’a  =  2  indicates  that  the  third  equation  in  (3.11)  was  the  second  equation 


in  the  original  system  (3.7).  Now,  since  Q32  =  0,  no  arithmetic  is  required  in  the 
third  row.  In  row  four,  the  arithmetic  will  be  equivalent  to  the  notion  of  adding  (the 


current)  equation  two  to  equation  four.  The  result  is 
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When  we  move  the  pivot  index  to  the  third  equation  {k  =  3),  we  notice  that  Q33  =  0. 
The  divide-by-zero  problem  has  resurfaced.  Once  again,  we  pivot,  swapping  rows 
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three  and  four.  After  this,  we  have 
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The  zero  beneath  the  final  pivot  obviates  the  need  for  further  arithmetic.  The  trian¬ 
gular  system  (3.13),  found  by  our  machine  method,  does  not  look  like  the  system  (3.5) 
from  the  hand  method  because  we  did  not  perform  the  same  row  interchanges.  If  we 
had  maintained  a  row  permutation  vector,  q,  for  the  hand  method  we  would  have 


noticed  that 


(3.14) 


Of  course,  back  substitution  for  the  final  (triangular)  machine  system  (3.13)  yields 


the  same  solution 


(3.15) 


as  that  of  the  hand  method.  Thus,  even  though  we  used  different  permutation 
schemes,  the  “pivots”  in  both  cases  were  always  nonzero  and  the  solutions  were  the 
same.  This  is  not  surprising,  since  A  is  nonsingular  and  row  permutation  is  merely 
the  practice  of  interchanging  equations. 

Let  us  review  first  the  process  and  then  the  theory  of  Gaussian  elimination. 
The  GE  process  performs  a  systematic  elimination  of  the  lower  (in  our  example) 
triangle  of  a  matrix  of  coefficients,  A.  Arithmetic  operations  are  performed  upon 
entire  equations  at  the  same  time  (including  the  right-hand  side,  b).  In  other  words, 
during  stage  k  of  the  process,  arithmetic  operations  are  performed  upon  (portions  of) 
all  rows  i  {i  >  k)  of  A  and  upon  all  elements  (rows)  /?,  (for  i  >  k)  of  the  right-hand 
sides,  b.  The  process  depends  upon  both  A  and  6  and  both  of  them  can  be  changed 
substantially. 

The  idea  behind  Gaussian  elimination  is  that  general  square  systems  are 
difficult  to  solve,  but  triangular  systems  are  easy.  The  goal  is  to  transform  a  general 
matrix  A  into  triangular  form,  performing  legitimate  arithmetic  upon  entire  equa- 
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tions  (including  the  right-hand  sides).  Reduction  to  triangular  form  costs  2n^/3 
flops.  Once  A  is  reduced  to  triangular  form,  back  substitution  yields  a  solution  for 
the  unknown,  u,  in  r?  flops.  Thus  GE  solves  a  general,  dense,  square  system  of  n 
equations  in  n  unknowns  by  the  application  of  2n^/3  +  n^  flops.  [Ref.  21 :  pp.  88,  97] 

E.  GAUSS  FACTORIZATION 

Gauss  factorization  (GF)  is  a  well-known  method  for  solving  linear  systems 
like  (3.1)  that  (simultaneously)  factors  A.  GF  has  strong  ties  to  the  GE  process. 
Those  ties  will  become  evident  as  we  develop  the  same  example  over  again,  this  time 
using  the  GF  bookkeeping  and  method.  GF  holds  several  major  advantages  over  GE. 
Among  these;  A  is  recoverable  (the  process  does  not  destroy  it)  and  the  process  is 
independent  of  the  right-hand  side,  b.  In  fact,  b  is  not  used  in  the  factoring  process. 

1.  Complete  Pivoting 

The  complete  pivoting  strategy  will  be  applied  in  this  example.  There  is  no 
special  significance  behind  the  introduction  of  complete  pivoting  with  the  GF  process. 
Either  strategy — the  choice  of  a  “no  pivoting”  strategy  is  also  available,  but  not 
generally  acceptable  for  serious  problems — can  be  used  with  GEor  GF.  The  complete 
strategy  is  a  'Straightforward  extension  ol  the  partial  strategy,  so  introducing  partial 
pivoting  first  was  practical. 

With  complete  pivoting,  row  interchanges  are  still  allowed,  but  so  are  col¬ 
umn  interchanges.  We  will  continue  to  use  q  €  J?”*  for  row  interchange  bookkeeping. 
The  vector  p  €  3?",  similarly,  will  maintain  the  column  permutation  information.  We 
search  not  just  the  pivot  column,  but  the  entire  Gauss  transform  area,  for  the  next 
pivot.  This  takes  longer  but  generally  produces  better  solutions.  The  numerical  dif¬ 
ferences  between  partial  and  complete  pivoting  involve  some  difficult  error  analysis. 
These  issues  will  be  addressed  briefly  after  we  complete  the  examples. 
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2.  Example 


Now  the  GF  process  is  demonstrated. 


We  start  with  the  same  system  of 


four  equations  in  four  unknowns: 


201  +  3o2  +  4o3  +  5o4  =  0 

4oi  +  602  +  803  +  5o4  =  —5 

2ui  +  4o2  +  7t»3  +  9o4  =  13 

601  +  802  +  803  +  9o4  =  —17 


(3.16) 


and  proceed  immediately  to  the  matrix  of  coefficients  (the  factoring  part  of  GF 


concerns  itself  with  A  only). 


■234 
4  6  8 
2  4  7 
6  8  8 


5  ■ 

5 

9 

9 


(3.17) 


a.  Stage  Zero 

For  the  initial  stage,  k  =  0,  let  the  Gauss  transform  area  he  G  =  A. 
Also  initialize  pivot  indices  s  =  t  =  1 .  The  sole  purpose  of  stage  zero  is  to  find  the 
first  pivot.  Initially,  we  guess  that  the  pivot  is  On,  located  at  A(l,l),  the  upper 
left-hand  corner  of  G.  (This  is  the  position  where  the  new  pivot  will  be  installed). 
Accordingly,  we  set  row  and  column  indices,  s  =  1  and  f  =  1  to  keep  track  of  the 
best  pivot  candidate. 

Indices  s  and  t  are  changed  only  when  we  find  a  superior  candidate  for 
the  pivot.  To  begin  the  column-by-column  search  for  the  pivot  we  move  down  the 
columns  in  order  from  left  to  right  and  through  each  column  in  a  top-to-bottom 
manner.  When  we  have  considered  every  element  in  G,  we  know  that  the  next  pivot 
is  currently  situated  at  A{s,t). 

For  the  current  example,  as  we  move  down  the  first  column  of  G,  the 
values  of  s  and  t  are  adjusted  twice.  A  better  pivot  candidate  is  found,  first  at  y4(2, 1), 
and  next  at  i4(4, 1).  The  indices  are  adjusted  again  in  the  last  row  of  column  two, 
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where  the  value,  8,  is  larger  than  the  value  of  the  current  candidate,  6.  Column 
three  has  no  candidates  larger  than  8,  so  we  do  not  adjust  the  indices  again  until  we 
find  the  9  at  >1(3,4).  Thus  s  =  3  and  t  =  4  have  located  the  next  pivot  according 
to  a  complete  pivoting  strategy.  This  accomplishes  the  goal  of  stage  zero.  Now  we 
specify  the  process  for  each  of  the  remaining  stages. 

b.  Outline  of  the  GF  Process 

For  each  stage,  k,  of  GF,  we  shall  perform  the  following  steps: 

•  Locate  the  pivot  according  to  a  pivoting  strategy  (none,  partial,  or  complete). 
If  complete  pivoting  is  used,  search  all  of  G  for  the  next  pivot. 

•  Increment  the  pivot  index,  k. 

•  Perform  any  row  and/or  column  permutations  that  are  required  to  move  the 
pivot  into  the  position  A{k,k).  Update  p  and  q  accordingly. 

•  Divide  every  element  beneath  the  pivot  by  the  pivot  value. 

•  Redefine  the  Gauss  transform  area  so  that  G  =  A{{k  +  1)  :m  ,  (t  +  1)  :n). 

•  Perform  the  appropriate  arithmetic  in  G. 

Let  us  return  to  the  example  and  exercise  the  process. 

c.  Stage  One 

Since  stage  zero  has  already  located  the  first  pivot,  the  first  step  of 
section  b  is  not  necessary  in  this  stage.  We  increment  /:  (to  A:  =  1)  and  install  the 
pivot  >1(3,4)  at  A{k,k)  =  ^4(1,1).  This  means  that  rows  1  and  3  must  be  swapped. 
Columns  1  and  4  must  be  swapped  in  addition.  The  permutation  vectors,  p  and  q, 
record  the  interchanges. 
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After  interchanging  rows  and  columns,  we  have 

’9472]  n 

.  5  6  8  4  i 

"^“  5  3  4  2  : 

9  8  8  6 


(3.18) 


Now  we  perform  the  division  beneath  the  pivot,  producing  the  multipliers  in  the 
lower  three  rows  in  the  leftmost  column  of  A.  When  this  is  done,  we  perform  the 
arithmetic  in  G  =  A{{k  +  1) ;  m ,  (t  4-  1) ;  n)  =  A(2  ;  4,2  ;  4).  For  GF,  we  do  not 
replace  the  multipliers  with  zeros.  We  shall  find  that  the  multipliers  are  very  useful 


in  the  end.  The  result  is 


■  9  4  7  2  ■ 

_  5/9  34/9  37/9  26/9 

~  5/9  7/9  1/9  8/9  * 

14  14 


(3.19) 


Next,  with  G  being  the  lower  right  (3  x  3)  block  of  A,  we  search  G  for  the  next  pivot 
and  find  that  A{s^t)  =  A(2,3)  holds  (37/9),  the  largest  second  pivot  candidate. 

d.  Stage  Two 

We  increment  the  stage  counter  {k  =  2),  so  that  it  points  to  the  new 
pivot  location,  A(2,2).  Since  s  =  fc,  we  know  that  no  row  interchange  is  necessary 
and  q  will  not  change.  We  must,  however,  swap  columns  k  =  2  and  f  =  3.  The  result 


5/9  37/9  34/9  26/9 
5/9  1/9  7/9  8/9 


(3.20) 


Once  again,  we  divide  everything  under  the  pivot  by  the  value  of  the  pivot  and 


update  G.  This  yields 


■  9  7  4  2 

5/9  37/9  34/9  26/9 

5/9  1/37  25/37  30/37 

1  9/37  114/37  122/37 


(3.21) 
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e.  Stage  Three 


Now  G  becomes  the  (2  x  2)  lower  right  block  of  A  and  the  next  pivot 
(122/37)  is  found  at  A{s,t)  =  >4(4,4).  Since  A:  =  3  we  must  interchange  rows  3  and 
4  as  well  as  columns  3  and  4.  The  result  of  the  permutation  is 


■97  2  4  ■ 

5/9  37/9  26/9  34/9 

■  4  ■ 
3 

■  3  • 
2 

1  9/37  122/37  114/37 

P  — 

1 

9  = 

4 

.5/9  1/37  30/37  25/37 

.  2  . 

.  1  . 

(3.22) 


Then,  dividing  at  the  bottom  of  the  pivot  column  and  updating  G,  we  have 


■  9 

7 

2 

4 

5/9 

37/9 

26/9 

34/9 

1 

9/37 

122/37 

114/37 

.  5/9 

1/37 

15/61 

-15/183  . 

(3.23) 


/.  Stage  Four 

The  final  stage,  where  A  =  4  =  min(m,n),  is  always  trivial.  We  need 
only  to  verify  that  044  is  nonzero.  This  tells  us  that,  indeed,  A  is  nonsingular.  There 
is  no  arithmetic  to  perform,  so  (3.23)  is  the  final,  factored,  copy  of  A. 


g.  Summary 

Using  the  Gauss  factorization  process  we  have  systematically  trans¬ 
formed  the  matrix  A  6  into  a  form  that  factors  the  original  version  of  A.  At 
this  point  the  factorization  itself  has  not  been  discussed,  only  the  process  whereby 
we  claim  to  have  factored  A.  Before  we  explore  the  resulting  factorization,  let  us 
consider — in  a  general  way — what  happens  in  any  stage.  A:,  of  GF. 

3.  One  Stage  of  Gauss  Factorization 

The  most  important  part  of  GF  is  the  factorization  that  it  produces. 
The  GF  process  is  reversible  (pivots  and  other  key  information  become  part  of  the 
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factorization).  This  section — using  block  matrix  notation  and  induction  on  the  stage 
number — illustrates  the  effect  of  one  stage  of  GF.  The  proof  shows  that  we  can 
perform  an  n-step  Gauss  factorization  A  =  LR,  with  L  unit  lower  triangular  and  R 
right  (upper)  triangular  with  nonzero  diagonal  elements.  Before  the  proof,  however, 
let  us  consider  a  concrete  illustration  where  n  =  15. 

Let  ®  denote  those  elements  that  Gauss  has  fixed  in  both  value  and  position. 
The  X  symbol  marks  elements  that  are  subject  to  permutations  but  not  changes  in 
value.  Those  elements  that  are  subject  to  both  permutation  and  changes  in  value 
are  indicated  by  the  O  symbol.  Elements  in  the  pivot  row  are  marked  with  the  0 
symbol  and  the  symbol  0  denotes  elements  beneath  the  pivot.  White  space  indicates 
zeros,  a  is  the  pivot,  and  any  p,  was  a  former  pivot  (in  stage  i).  Let  fc  =  7.  Then 
the  leftmost  7  columns  of  R^  are  already  fixed  in  upper  triangular  form  and  L^  is 
unit  lower  triangular  with  the  special  form  described  above.  Upon  entering  stage 
{k  +  \)  =  Soi  the  Gauss  factorization  process,  the  matrices  L7  and  Rj  would  appear 
as  shown  below: 


1 

0  1 

0  0  1 


0 

0 

1 

1 

0 

0 

0 

1 

0 

<8) 

0 

0 

0 

1 

e 

e 

e 

e 
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0 
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X 
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X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

(3.24) 


52 


/>i®®®<8)®®®xxxxxxx' 

^2®®®®®®XXXXXXX 
P3®®®®®XXXXXXX 
/)4®®®®XXXXXXX 
^5®(S<8IXXXXXXX 
/)6®®XX^  XXXX 
/)7(®XXXXXXX 

iZ7=  aeeeeee©  (3.25) 

0©©O©©©O 

0©©©©©©© 

0©©©©©©© 

©v^©©©©©© 

0©©©©©©© 

0©©©©©©© 

0©©©©©©©. 

With  this  illustration  in  mind,  let  us  prove  the  elTect  of  GF. 


Proposition:  Given  A  €  Let  Li  €  be  the  unit  lower  triangular  matrix 

with  — the  {n  —  i)x{n—i)  identity — as  its  lower,  right-hand  block.  Let  /?,  £ 

be  the  matrix  that  is  upp'^i  right  triangular  in  its  leftmost  i  columns.  Initially,  let 
A  =  LqRc  with  Lo  =  I  and  =  A.  Let  P{k)  be  the  proposition;  “Stage  k  of  the 
Gauss  factorization  process  yields  the  factorization,  A  =  LkRk  ” 

To  Show:  P{k)  =>  P{k  +  1)  for  0  <  k  <  {n  —  1). 

Assumptions:  Pivoting,  according  to  any  valid  strategy,  is  performed  outside  of 
this  factorization  procedure  and  the  pivoting  strategy  yields  pivots,  o;  ^  0. 
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Notation:  We  can  partition  A  so  that 


A  = 


T 

a 

X  G 


(3.26) 


where  a  €  3?  is  the  initial  pivot,  x  €  holds  the  values  beneath  the  pivot, 

y  6  3?"“^  holds  the  values  of  the  elements  in  the  pivot  row  to  the  right  of  the  pivot, 
and  G  €  is  the  Gauss  transform  area. 


Basis  for  Induction:  We  must  show  that  P(0)  =»  ^(1)-  ■P(O)  means  that  Lq  =  /„ 
and  Ro  =  A.  That  is,  Rq  has  no  special  structure  except  (by  assumption)  we  are 
guaranteed  a  nonzero  pivot  q.  Consider  stage  Ar  =  1  of  Gauss  factorization.  Let  us 
partition  A  as  above  and  factor 


A  = 


T  1 
a  y^ 

■  1 

0^' 

r  T  1 

p 

X  G 

£ 

I 

OB 

—  L\R\ 


(3.27) 


where  B,  r,  and  p  (with  the  obvious  sizes)  are  defined  as 


p  =  a 

(3.28) 

r  =  y 

(3.29) 

=  0^ 

(3.30) 

=  G  -  £r^ 

(3.31) 

Thus,  given  A  =  LqRq,  Gauss  factors  A  =  L\R\  and  P(0)  =>■  •P(l). 


Inductive  Step:  Consider  the  matrices  L*  and  Rk  that  are  submitted  to  stage 
{k-\- 1)  of  a  Gauss  factorization  procedure.  We  make  the  inductive  step  to  show  that 
P(k)  P{k  +  1).  For  0  <  k  <  n,  A  =  LkRk  niay  be  partitioned  so  that 


'  L  0  0' 

'  R  s  T  ‘ 

A  = 

m’’  1  0 

0^  Q  y^ 

N  0  I 

I 

o 

H 

» — 

(3.32) 
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where  L  €  is  a  unit  lower  triangular  matrix  and  R  6  is  a  right  (upper) 
triangular  matrix  with  nonzero  diagonal  elements. 

The  Gauss  process  forms  p  as  in  (3.28),  r  as  in  (3.29),  multipliers,  I  as 
in  (3.30),  and  as  in  (3.31).  Then,  f  )r  0  <  Ar  <  (n  —  1),  GF  forms 

R  s  T 

0’’  P  I  =  (3.33) 

0  0  G 

Thus,  for  0  <  A:  <  n,  P{k)  =>  P{k  +  1).  [Ref.  24] 


■  L 

0 

0  ■ 

A  = 

T 

1 

0 

.  ^ 

e 

I 

Conclusion:  The  nonsingular  matrix  A  G  can  be  factored,  in  n  steps  of  the 
Gauss  factorization  process,  so  that  A  =  LR  with  L  being  unit  lower  triangular  and 
R  being  upper  triangular  with  nonzero  diagonal  elements. 

The  proof  has  demonstrated  the  effect  of  GF.  For  simplicity,  it  excluded 
the  pivoting  strategy  (simply  assuming  that,  at  every  stage,  a  pivot  a  ^  0  would  be 
available).  It  also  held  A  square.  In  this  sense  the  proof  is  somewhat  specific.  There 
is  a  more  general  conclusion  to  be  made.  This  conclusion  holds  for  GF  with  pivoting 
and  0  ^  i4  G  and  it  is  absolutely  essential  to  understanding  the  factorization. 

4.  The  LR  Theorem 

With  the  GF  process  complete,  and  the  veist  majority  of  the  work  done, 
we  show  how  to  form  a  solution  from  our  factorization.  Various  methods  of  pivoting 
(resulting  in  permutation  vectors)  and  the  method  whereby  A  is  factored  have  been 
discussed.  To  solve  the  system,  we  must  put  all  of  this  information  together.  The 
key  is  the  LR  Theorem  [Ref.  24]: 

Theorem  3.1  (LR  Theorem)  Let  0  ^  A  £  Then  there  are  permutation 

matrices  P  G  and  Q  G  an  integer  r  >  \,  a  lower  trapezoidal  matrix 

L  G  and  an  upper  (right)  trapezoidal  matrix  R  G  3?’'*"  so  that  Q^AP  =  LR. 

The  diagonal  elements  of  L  satisfy  A,,i  =  1  with  i  =  l,2,...,r  and  the  diagonal 
elements  of  R  satisfy  p;  i  ^  0  for  i  =  1, 2, . . .  ,r. 
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5.  Filling  in  the  Blanks 


a.  The  Main  F actors 


GF  used  the  space  of  A  to  hold  the  two  principal  matrices,  L  and  R, 
in  the  factorization  of  A.  To  see  them,  we  will  extract  the  lower  triangular  matrix, 
L,  and  upper  (right)  triangular  matrix,  /2,  from  the  final  copy  of  A  (3.23).  Initially, 
let  L  =  =  0.  We  form  L  by  placing  ones  on  its  diagonal  and  filling  the  elements 

below  the  diagonal  from  the  corresponding  locations  in  A. 


■  1  0  0  0' 

r  5/9  I  00 

^  “  1  9/37  1  0 

.  5/9  1/37  15/61  1  . 

R  is  formed  with  the  diagonal  elements  (i.e.,  pivots)  and  upper  triangle  of  A. 


(3.34) 


■97  2  4 

0  37/9  26/9  34/9 

0  0  122/37  114/37 

.0  0  0  -15/183 


b.  Permutation  Matrices 


The  bookkeeping  allows  us  to  construct  P  and  Q  very  quickly.  To  form 
P  €  we  set  every  column,  j,  in  P  equal  to  the  axis  vector  implied  by  Xj,  the 

element  of  p.  This  yields  the  permutation  matrix,  P,  that  will  satisfy  the  LR 
Theorem,  namely 


■  ■ 

■  4  ■ 

■  0 

0 

1 

0  ■ 

^2 

3 

_ .of  1 

0 

0 

0 

1 

p  = 

— 

1 

=►  F  -  [  C4  C3  ej  62  j  = 

0 

1 

0 

0 

.  ^4  . 

.  2  . 

.  1 

0 

0 

0  . 

(3.36) 


Similarly,  every  column,  j,  in  Q  €  is  set  equal  to  the  axis  vector  implied  by 

V’j,  the  j'*  element  of  q.  For  our  example,  we  have 


■  V’l  ■ 

■  3  ■ 

- 1 

0 

0 

0 

—i 

02 

03 

= 

2 

4 

Q  =  [  €3  C2  64  Cl  ]  = 

0  10  0 
10  0  0 

.  04  . 

.  1  . 

.0010. 

(3.37) 
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c.  Check 


Now  we  check  to  make  sure  that  our  solution  satisfies  the  LR  Theorem. 


First,  consider  the  product  LR: 


•  1 

0 

0 

0  ■ 

■  9 
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4 

5/9 

1 
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0 

37/9 

26/9 

34/9 

1 

9/37 
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0 
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0 

122/37 

114/37 

.  5/9 

1/37 
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1  . 

.  0 

0 

0 

-15/183  . 

And 


9  7  2  4 
5  8  4  6 
9  8  6  8 
5  4  2  3 


Q^AP  = 
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=  (Q'^AjP 


'2  4  7  9' 

O 

o 

o 

4  6  8  5 

0  0  0  1 

6  8  8  9 

0  10  0 

.2345. 

.1  0  0  0. 
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Our  factorization  satisfies  Q^AP  =  LR. 


(3.38) 


(3.39) 


(3.40) 


(3.41) 


(3.42) 


d.  Solution 

Now  we  solve  the  system.  Recall  that  Gaussian  elimination  operated 
on  the  matrix,  A,  and  the  right-hand  side,  6,  at  the  same  time.  The  end  result  of 
GE  is  that  A  is  reduced  to  upper  triangular  form  by  successive  elimination  of  the 
lower  triangle  so  that  we  could  solve  for  u  with  a  relatively  easy  back  substitution. 

The  strategy  of  Gauss  factorization  is  different.  First,  b  is  not  part  of 
the  factorization  process.  Secondly,  even  though  we  are  changing  A,  we  know  that 
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we  can  get  it  back  at  the  end  (if  we  want  to),  so  there  is  no  need  to  save  the  original 
A.  Now,  using  the  LR  Theorem,  we  complete  the  solution.  Recall  that  the  original 
system  was 

Au  =  b.  (3.43) 

The  factorization  process  constructs  permutation  matrices  P  and  Q  and  transforms 
the  original  matrix  A  into  a  combined  version  of  L  and  R.  Further  (by  the  LR 
Theorem)  we  know  that  these  matrices  satisfy 

Q^AP  =  LR.  (3.44) 

Now,  by  multiplying  (3.44)  through  by  Q  from  the  left  and  P^  on  the  right,  we  see 
that 

QQ'^APP’^  =  QLRP'^.  (3.45) 

Performing  the  cancellations  on  the  left-hand  side,  we  have 

A  =  QLRP'^.  (3.46) 

This  is  the  factorization  of  A.  Substituting  this  into  (3.43)  yields 

QLRP^u  =  b  (3.47) 

or 

LRP'^u  =  Q'^b.  (3.48) 

Now  let  b  =  Q^b  and  let  u  =  P^u.  Then 

LRu  =  b.  (3.49) 

Further,  let  Ru  =  c  for  some  unknown  vector,  c.  We  have 

Lc  =  b.  (3.50) 


58 


Since  we  know  L  and  b,  we  may  solve  for  c  by  a  simple  forward  substitution.  Then, 
using  c  and  knowing  that  Ru  =  c,  we  perform  a  simple  back  substitution  and  deter¬ 
mine  ti.  Finally,  by  definition,  u  =  P^u  (i.e.,  u  is  a  mere  permutation  of  u)  so  we 
can  swap  elements  in  u  to  arrive  at  u  using  Pu  =  u. 

Let  us  summarize  this  lengthy  process  into  the  main  steps.  The  GF 
process  factors  A  =  QLRP^,  changing  the  general  matrix  into  a  product  where  the 
most  significant  factors  are  both  triangular.  This  reduces  the  hard  problem  to  two 
easy  ones.  It  is  designed  so  that  we  can  solve  for  u  in  two  steps: 

•  Solve,  by  forward  substitution,  the  system  Lc  =  b  for  a  vector,  c,  of  unknowns. 

•  Solve,  by  back  substitution,  the  system  Ru  =  c  for  (a  permutation  of)  the 
original  unknowns,  u. 

So,  for  our  example,  the  first  step  is  to  solve 


■  1 

0 

0 

0  ■ 

■  Cl  ■ 

■  13  ■ 

■ 

Lc  = 

5/9 

1 

1 

9/37 

0 

1 

0 

0 

C2 

C3 

=  Q’'6  = 

-5 

-17 

= 

$2 

^3 

.  5/9 

1/37 

15/61 

1  . 

.  C4  . 

0  . 

./?4. 

(3.51) 


Forward  substitution,  applied  to  this  system,  yields 


■  Cl  ■ 

13 

C2 

-110/9 

C3 

-1000/37 

.  C4  . 

-15/61 

(3.52) 


59 


Now  we  know  c,  so  we  can  solve  the  second  triangular  system,  Ru  =  c  for  u  by  back 
substitution 


■  9  7  2  4  1  r  1  r  13 

0  37/9  26/9  34/9  _  -110/9  _  . 

0  0  122/37  114/37  03  "  -1000/37  ”  1  1 

.0  0  0  -15/183  J  L  «74  J  -15/61 

which  yields 

1  ■ 

.n  (s-s-i) 

3 

Now  it  is  easy  to  recover  u.  Since  we  have  defined  £i  =  P^u,  we  know 
that  Pit  =  u  (a  simple  rearrangement  of  the  elements  that  we  have  already  found). 
We  apply  P  to  u  and  find  that 


0  0  1  0  1  r  0*1 1  r  ua  I  r  -11 

0001  ^2_^4_  3 

0  10  0  V3  V2  2 

1  0  0  0  .  .  U4  .  .  .  .  1 

Comparing  this  to  earlier  solutions,  we  find  that  GF  has  arrived  at  the  same  solution. 

In  these  examples,  the  notion  of  elimination  was  developed  first.  The 
GE  process  performs  successive  eliminations  beneath  its  pivots  and  reduces  A  to 
triangular  form,  and  then  the  solution  is  available  in  only  n*  flops.  GF  spends 
an  almost  identical  amount  of  work  in  the  reduction  process,  but  the  result  is  a 
factorization  with  L  and  R  being  the  significant  factors.  (They  are  the  only  ones 
that  are  more  than  a  permutation  of  the  identity).  In  the  examples,  we  used  pivoting 
because  it  was  practical.  Now  let  us  take  a  closer  look  at  the  justifications  for 
pivoting. 
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F.  PIVOTING  FOR  SIZE 


The  issue  of  pivoting  is  a  very  interesting  and  important  one.  We  concluded  that 
we  must  pivot  or  face  the  possibility  of  attempting  to  divide  by  zero,  an  unacceptable 
option.  To  solve  this  problem,  we  may  pick  any  nonzero  element  in  A{k  :m,k:n) 
and  perform  the  column  and  row  interchanges  required  to  install  it  as  the  new  pivot 
(k  is  the  pivot  index).  There  are  many  strategies  that  we  could  adopt. 

The  logical  question  would  be  something  like:  “Given  that  we  must  pivot,  what 
is  the  best  means  available?”  But  the  answer  is  not  so  easy,  and  there  are  many 
trade-offs  to  be  considered.  We  are  faced  with  choosing  along  a  spectrum,  where 
speed  lies  at  one  end  and  accuracy  lies  at  the  other.  For  instance,  we  could  begin  a 
search  and  pick  the  first  nonzero  element  in  this  area.  Or,  we  could  search  for  the 
row  with  the  most  nonzero  elements  (that  had  a  nonzero  element  in  the  k*'*  column). 

The  two  most  common  strategies  for  pivoting  are  the  partial  and  complete  meth¬ 
ods,  which  we  have  discussed.  We  determined  that  partial  pivoting  would  work  per¬ 
fectly  (with  no  error)  if  A  was  nonsingular  and  the  storage  and  arithmetic  could  be 
handled  with  infinite  precision.  If  infinite  precision  were  available,  we  could  stop 
right  here.  There  would  be  no  need  to  try  to  refine  the  method.  In  a  finite-precision 
machine,  however,  we  must  deal  with  the  issue  of  errors. 

To  deal  with  errors,  the  problem  must  be  stated  more  precisely.  The  errors 
that  concern  us  would  arise  due  to  growth  of  the  elements  of  L  and/or  R  as  we  step 
through  the  stages  of  Gauss.  In  the  end,  partial  pivoting  guarantees  that  all  of  the 
elements  of  L  will  be,  at  most,  unity.  This  is  easy  to  see.  The  pivoting  strategy 
chooses  each  pivot  to  be  the  largest  element  (in  absolute  value)  in  column  k  at  or 
below  row  k.  This  value  is  installed  at  A{k,  k)  and  everything  below’  the  pivot  is 
divided  by  the  pivot. 
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Unfortunately,  partial  pivoting  cannot  make  the  same  guarantee  for  the  ele¬ 
ments  of  R.  It  helps;  the  multipliers  are  less  than  or  equal  to  one  in  absolute  value. 
The  elements  of  R  are  bounded  by  where  a  is  the  largest  absolute  value  of 

the  elements  in  A.  This  bound  is  not  normally  attained  “in  practice”.  [Ref.  23] 

Growth  is  an  indicator  of  trouble  in  this  process.  If  we  cannot  control  it  com¬ 
pletely,  we  should,  at  a  minimum,  monitor  it.  The  growth  factor,  g{n),  of  a  Gauss 
factorization  process  for  A  €  3R"*"  is  defined  as  follows.  Let  a  be  the  largest  absolute 
value  in  the  original  matrix,  A.  Let  b  be  the  largest  absolute  value  that  occurs  in 
any  Gauss  transform,  G,  including  the  first  one,  G  =  A.  Then  g{n)  =  b/a  gives  a 
growth  factor  normalized  by  a  (i.e.,  g{n)  >  1). 

A  great  deal  of  analysis  has  been  done  on  this  subject.  Wilkinson  showed 
that,  with  complete  pivoting  and  real  matrices,  g(n)  grows  much  more  slowly  than 
2'‘.  He  conjectured  that  g(n)  <  n.  The  latter  h«is  recently  been  disproved,  with  a 
counterexample  by  Nicholas  Young.  [Ref.  23] 

As  a  practical  matter,  when  one  seeks  to  monitor  growth  one  uses  complete 
pivoting.  To  consider  performance,  one  uses  the  partial  pivoting  strategy.  The 
growth  factor,  g(n),  is  easy  to  monitor  with  a  complete  pivoting  strategy  since  w'e  are 
moving  through  the  entire  Gauss  transform  area  at  each  stage  anyway.  For  clarity, 
the  pivoting  algorithms  and  the  Update  algorithm  are  listed  separately  in  this 
chapter.  In  real  code  (e.g..  Appendix  F),  however,  the  pivot  for  stage  (t-f  1)  should 
be  located  during  the  update  of  G  in  stage  k  (to  avoid  unnecessary  passes  through 
the  matrix).  This  would  mean  extra  work  in  the  partial  pivoting  algorithm.  Since 
the  primary  reason  for  using  partial  pivoting  is  performance,  it  is  counterproductive 
to  monitor  g(n)  while  using  partial  pivoting.  A  description  of  both  pivoting  policies, 
in  algorithm  form,  follows. 
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Algorithm  3.1  (Partial  Column  Pivoting  for  Size)  Given  the  matrix  of  coef¬ 
ficients,  A  €  a  permutation  vector,  q  €  S’”;  and  an  index,  k,  indicating  the 

pivot  column,  this  algorithm  performs  partial  pivoting.  First,  the  pivot  element  is 
located  at  A{s,k)  with  s  >  k.  Once  the  pivot  has  been  located,  rows  s  and  k  are 
swapped  to  install  the  new  pivot.  Additionally,  elements  in  q,  indexed  by  s  and  k, 
are  swapped  to  record  the  row  interchanges. 

begin  PP 

s  =  fc; 

for  i  =  (A:  +  1)  :  m 

s  =  i; 

end  if 

end  for 

if  (5  ^  k) 

for  j  =  1  ;  n 

r  =  A{k,jy, 

A{k,j)  =  A{s,jy, 

A{s,j)  =  x; 

end  for 

i  =  q{ky 

q{k)  =  q{sy 

q{s)  =  i; 

end  if 
end  PP 
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Algorithm  3.2  (Complete  Pivoting  for  Size)  Given  the  matrix  of  coefficients, 
A  €  permutation  vectors,  p  G  3?"  and  q  6  and  an  index,  k,  indicating  the 

pivot  row  and  column,  this  algorithm  performs  complete  pivoting.  First,  the  pivot 
element  is  located  at  A{s,t).  Once  the  pivot  has  been  located,  rows  s  and  k  and 
columns  t  and  k  are  swapped  to  instaP  the  new  pivot.  The  permutation  vectors  are 
updated  accordingly. 

begin  PC 

s  =  k] 

i  =  k; 

for  i  =  k  :  m  (locate  the  pivot) 

for  j  =  k  :  n 

if 

s  =  i; 

t  =  j; 

end  if 
end  for 
end  for 

if  {s  ^  k)  (row  interchanges) 

for  j  —  \  :  n 

X  =  A{k,jy,  A{k,j)  =  A{s,j);  A{s,j)  =  x; 

end  for 

i  =  of  k):  q{k)  =  g{s);  q{s)  =  v, 

end  if 

if  (<  ^  k)  (column  interchanges) 

for  i  =  1  :  m 

x  =  A{i,ky,  A{i,k)  =  A{i,ty,  A{i,t)  =  x’, 

end  for 

i  =  p{ky  p{k)  =  p{ty  p(t)  =  i; 

end  if 
end  PC 
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G.  SEQUENTIAL  ALGORITHMS 


The  examples  considered  have  described  the  Gauss  process.  We  first  considered 
elimination  (GE)  and  then  a  factorization  method  (GF).  Both  methods  require  work 
of  the  same  order,  so  the  latter,  yielding  a  factorization  of  A  is  much  preferred. 
Algorithms  for  the  GF  process  are  described  below.  The  arithmetic  in  the  Gauss 
transform  area,  G,  is  performed  the  same  (regardless  of  pivoting  strategy)  so  a 
separate  algorithm  is  given  for  updating  G.  The  algorithms  GFPP  (pivoting,  partial) 
and  GFPC  (pivoting,  complete)  are  given  following  the  updating  algorithm.  These 
algorithms  are  adapted  from  Gragg  [Ref.  23]. 


Algorithm  3.3  (Update  Gauss  Transform  Area)  Given  the  matrix  of  coeffi¬ 
cients,  A  €  and  k,  the  pivot  column,  this  algorithm  performs  the  appropriate 

arithmetic  throughout  the  pivot  column  and  Gauss  transform  area,  G,  of  A. 

begin  Update 


X  =  A{k,ky, 

for  i  =  {k  +  1)  :  m 
A{i,  k)  =  A{i,  fc)/x; 

end  for 


for  i  =  {k  1)  :  m 


X  —  A(i,  k), 

for  j  =  1  :  n 

-xy.  A(k,j); 

end  for 
end  for 
end  Update 


(x  is  the  pivot  value) 
(pivot  column  division) 


(arithmetic  in  G) 
(now  X  is  the  multiplier) 
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Algorithm  3.4  (Gauss  Factorization  with  Partial  Pivoting)  Given  the  matrix 
of  coefficients,  A  €  this  algorithm  modifies  (overwrites)  A  with  a  unit  lower 

triangular  matrix  (with  an  implicit  diagonal),  L  6  S”*",  and  an  upper  (right)  trian¬ 
gular  matrix,  R  €  having  nonzero  diagonal  elements  (the  pivots).  The  process 

also  forms  the  row  permutation  vector,  q,  and  the  corresponding  permutation  matrix, 
Q  6  that  results  from  partial  column  pivoting  for  size.  The  algorithm  gives 

the  factorization:  Q^A  =  LR. 

begin  GFPP 

n  =  oro’3r(/4) 

Q  =  zeros(n,  n) 
for  j  =  1  :  n 

q{j)=j]  (initialize  9) 

end  for 

for  r  =  1  :  n 

PP(.4,9.^) 

i{(Aik,k)  =  0) 

print  “A  is  singular!” 
exit 

end  if 

Update(A,  k) 
end  for 

for  j  =  I  :  n 

Q{<iU)J)  =  1-0; 

end  for 
end  GFPP 
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Algorithm  3.5  (Gauss  Factorization  with  Complete  Pivoting)  Given  a  ma¬ 
trix  of  coefficients,  A  €  S?”***”,  the  following  algorithm  modifies  (overwrites)  A  with 
a  unit  lower  trapezoidal  matrix  (with  implicit  diagonal),  L  6  3?'”*",  and  an  upper 
(right)  trapezoidal  matrix,  R  €  S”**".  The  diagonal  elements  of  R  are  nonzero  (piv¬ 
ots).  The  process  forms  permutation  matrices,  P  €  3^"*”  and  Q  G  to  reflect 

the  complete  pivoting  for  size.  These  matrices  are  formed  to  satisfy  the  LR  Theorem: 
Q'^AP  =  LR. 


begin  GFPC 

m  =  rows(y4);  n  =  cols(>l); 

P  =  zeros(n,Tj);  =  zeros(m,m); 
for  j  =  1  :  n 

p{j)  =  i; 

end  for 

for  i  =  1  :  771 
<7(0  =  i; 

end  for 

for  r  =  1  ;  n 

PC{A,g,k) 

i{{A(k,k)  =  0) 

print  “y4  is  singular!” 
exit 
end  if 

Update(/1,  /-) 
end  for 
for  j  =  1  :  n 

P{p{j)J)  =  1-0; 

end  for 

for  ji  =  1  :  rrj 

=  1-0; 

♦'nd  for 
end  GFPC 


(initialization) 


(the  Gauss  process) 
(pivoting) 


(Uprlate  G) 


(form  P) 


(form  Q) 
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H.  CONJUGATE  GRADIENTS 


Time  permits  only  a  brief  synopsis  of  the  method  of  conjugate  gradients  (CG). 
This  method  wa^  described  by  Magnus  R.  Hestenes  and  Eduard  Stiefel  [Ref.  18]. 
CG  possesses  some  very  nice  characteristics  and  it  is  quite  different  from  the  Gauss 
method.  Once  again,  we  begin  with  a  system  of  linear  equations 


Au  =  b  (3.56) 

The  algorithm  given  by  Hestenes  and  Stiefel  is  designed  for  A  €  symmetric 
and  positive  definite  (Appendix  A).  Let  s  G  3?”  be  the  vector  that  would  solve  (3.56) 
exactly,  so  that  As  =  b.  Let  Ui  €  3?”  be  the  estimate  of  the  solution,  s,  produced 
in  the  iteration.  The  original  estimate,  uo,  is  merely  a  guess  (it  may  be  a  good 
guess).  For  instance,  in  the  absence  of  better  information,  we  could  choose  uq  to  be 
the  vector  of  all  zeros  or  all  ones. 

The  CG  process  takes  our  initial  guess  and  develops  a  (guaranteed)  better 
estimate  for  the  next  stage.  To  measure  the  progress,  we  could  use  the  residual 
vector 

r,  =  6  —  Aui  (3.57) 

but  Hestenes  and  Stiefel  warn  that  its  Euclidean  norm,  ||  r,  ||2,  may  actually  increase 
in  every  step  but  the  last!  A  more  reliable  measure,  called  the  error  vector 

e,  =  s  —  u,  (3.58) 

has  monotonically  decre«ising  length.  After  n  iterations  of  the  CG  process,  we  are 
guaranteed  to  have  a  very  good  estimate  u„  of  s.  In  fact,  if  no  rounding  errors 
occur,  we  have  u„  =  s.  In  practice,  CG  can  find  a  very  good  estimate,  u,n,  of  s 
in  m  iterations,  with  m  n.  The  process  “terminates  in  at  most  n  steps  if  no 
rounding-off  errors  are  encountered.”  [Ref.  18:  p.  410] 
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The  algorithm  below  is  adopted  from  Hestenes  and  Stiefel  [Ref.  18].  Before 
considering  the  algorithm,  however,  we  should  define  the  key  term,  conjugaie.  For 
A  symmetric,  two  vectors  i  €  and  y  €  3?"  are  said  to  be  A-orthogonal  (or 
conjugate)  if  the  relation  x^Ay  =  {Ax)^y  =  0  holds  [Ref.  18  :  p.  410].  This  is 
an  extension  of  vector  orthogonality,  x^y  =  0.  The  algorithm  given  below  is  very 
simple.  The  iteration  blindly  proceeds  from  i  =  0  to  t  =  n.  A  more  sophisticated 
(finite  precision)  scheme  would  set  a  tolerance  (notion  of  “good  enough”)  and  stop 
(exit  the  loop)  when  this  criterion  was  satisfied. 


Algorithm  3.6  (The  Method  of  Conjugate  Gradients)  Given  the  symmetric, 
positive  definite  matrix  of  coefficients,  A  £  and  an  initial  guess,  uo/  for  the 

solution,  s;  of  the  system  Au  =  b,  this  algorithm  (in  the  absence  of  round'ng-off 
errors)  finds  u,  =  s  in  i  iterations  {i  <  n).  The  algorithm  keeps  track  of  a  residual 
vector,  ri,  and  direction  vectors,  p,  .  The  residuals,  r,,  are  mutually  orthogonal  and 
the  direction  vectors,  p,  are  mutually  conjugaie  (A-orthogonal). 


begin  CG 

Uq  =zeros(n) 

Po  =  ro  =  b  -  Auq 

for  i  =  0  :  n 
S  ^pf  Ap, 

=  {pjr,)lb 

u.+i  =  u,  +  a.p, 

r,+i  =  r,  -  Q,Ap, 

O,  =  {rl.ir,)/S 

P. +l  =  ^+1  +  0xPx 

end  for 
end  CG 


(arbitrary  initial  guess) 

(denominator  used  below) 
(scalar  multiplier  used  below) 
(estimate  of  solution) 
(residual  vector) 

(direction  vector) 
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I.  SUMMARY 


This  chapter  develops  the  Gaussian  elimination  process,  the  Gauss  factoriza¬ 
tion  process,  pivoting  strategies,  and  (briefly)  the  method  of  conjugate  gradients. 
Each  of  the  corresponding  algorithms  possesses  potential  for  parallel  solution.  A 
parallel  implementation  of  GF  appears  in  the  following  chapter.  Both  partial  and 
complete  pivoting  are  pursued,  with  further  discussion  on  their  implications  in  a 
parallel  environment. 
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IV.  PARALLEL  DESIGN 


Nature  is  pleased  with  simplicity,  and  affects  not  the  pomp  of  superfluous 
causes. 

—  SIR  ISAAC  NEWTON  (1642-1727) 

Sequential  algorithms  for  Gauss  factorization  (GF)  and  the  method  of  conjugate 
gradients  (CG)  are  established  in  Chapter  III.  The  goal  of  this  chapter  is  to  show 
parallel  algorithms  for  Gauss  factorization.  The  C  programs  that  implement  these 
algorithms  are  discussed  in  Chapter  V  and  listed  in  Appendix  F. 

Parallel  algorithm  design  is  a  process  that  includes  many  considerations.  The 
question  of  how  to  achieve  parallelism  is  largely  an  art  and  is  not  discussed  here. 
The  method  used  in  this  research  is  often  called  a  workfarm  approach  because  the 
algorithm  farms  out  work  to  processors.  Equivalently,  it  may  be  called  a  manager- 
worker  model.  When  we  distribute  the  problem  across  many  processors  in  a  workfarm 
style,  there  are  quite  a  number  of  issues  that  warrant  careful  consideration.  The 
concerns  associated  with  programming  a  parallel  machine — even  with  a  relatively 
simple  model  such  as  this — could  occupy  volumes. 

Communications,  load  balancing,  granularity,  and  other  considerations  abound. 
Metrics  like  speedup  and  efficiency  should  be  used  to  lend  credibility  to  the  parallel 
nature  of  the  algorithm.  Additionally,  we  should  consider  the  usual  issues  of  main¬ 
tainability,  readability,  portability,  and  other  traits  commonly  associated  with  good 
(sequential)  programming  practice.  Parallel  codes  must  be  clear  combinations  of 
sequential  codes  that  are  joined  together  in  a  logical  manner.  Simplicity  should  hold 
a  place  of  great  esteem  in  a  parallel  algorithm.  The  rest  of  this  chapter  introduces 
the  issues  of  parallel  design,  particularly  as  they  pertain  to  Gauss  factorization. 


A.  INTERPROCESSOR  COMMUNICATIONS 


Interprocessor  communication  is  one  of  the  most  fundamental  issues  in  parallel 
processing  and,  quite  possibly,  the  most  involved.  Without  a  means  of  communicat¬ 
ing  (in  a  message-passing  environment),  the  multiprocessor  system  is  meaningless. 
The  implications  of  any  communications  scheme  are  many  and  the  interactions  can 
be  quite  complex.  Exhaustive  coverage  of  this  issue  is  out  of  the  question,  so  we  will 
consider  a  few  of  the  most  essential  ideais. 

1.  The  Network 

A  network  is  the  part  of  a  multiprocessor  system’s  hardware  that  bears 
the  interprocessor  communications  burden.  It  is  a  combination  of  nodes  and  links 
that  f-onnect  those  nodes,  and  it  is  the  foundation  upon  which  all  communications 
must  build.  We  will  also  refer  to  the  nodes  of  a  multiprocessor — using  somewhat 
loose  terminology — as  processors.  The  term  node  is  a  more  general  term.  Nodes 
are  typically  more  sophisticated  than  a  simple  central  processing  unit  (CPU)  or,  for 
that  matter,  any  other  sort  of  processor.  The  link  is  a  wire  that  connects  two  nodes. 
An  interconnection  topology  describes  the  pattern  of  links  used  to  connect  the  nodes 
of  a  network.  The  network  can  be  drawn  or  illustrated  so  that  we  can  see  how  its 
nodes  are  connected.  Appendix  C  discusses  interconnection  topologies  and  it  gives 
a  description  (and  illustrations)  of  the  particular  scheme  used  in  this  research;  the 
hypercube. 

Intel  combines  an  80386  CPU  with  an  80387  math  coprocessor  and  commu¬ 
nications  facilities  to  form  a  “CX”  node  for  the  iPSC/2  that  was  used  in  this  research. 
INMOS  provides  the  same  general  capabilities  but  packages  it  all  on  a  (very  sophis¬ 
ticated)  single  chip,  called  a  transputer.  Figure  4.1,  from  INMOS’  T9000  Transputer 
Products  Overview  Manual  [Ref.  25:  p.  31],  shows  a  high-level  block  diagram  of  the 
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components  of  a  T9000  transputer.  Thus,  any  node  of  a  message-passing  multipro¬ 
cessor  system  can  be  thought  of  as  a  combination  of  computing  and  communications 
facilities.  It  may  possess  other  capabilities  as  wrM. 

2.  Message  Routing 

The  machines  used  in  this  research  exhibit  different  message  transmission 
schemes.  The  transputer  system  employs  high-speed  (20  megabits  per  second)  point- 
to-point  serial  communications  and  stort-and-forward  message  passing.  That  is,  for 
multi-hop  communications,  each  node  along  the  way  must  receive  the  message,  store 
it  in  local  memory  temporarily,  and  then  pass  it  to  the  next  node  in  the  route. 

The  Intel  iPSC/2  uses  another  technique,  called  circuit  switching  or  direct- 
connect  communications.  This  approach  is  much  like  our  telephone  system.  First, 
the  originator  of  the  message  sends  a  small  message  containing  information  about 
the  message  (e.g.,  destination  node  number,  length  of  message)  to  the  destination 
via  the  nodes  in-between.  As  this  small  header  packet  makes  its  way  to  the  destina¬ 
tion  the  nodes  along  the  way  flip  switches,  closing  a  circuit  from  the  sender  to  the 
receiver.  Once  this  circuit  is  established,  the  message  proceeds  from  the  sender  to 
the  destination  without  interruption. 

Each  method  has  its  advantages  and  disadv'antages.  The  circuit  switching 
approach  allows  for  fewer  interruptions  along  the  way,  but  it  ties  up  the  entire  path 
for  the  duration  of  the  communication.  The  store-and-forward  method  imposes 
delays  for  storing  the  message  into,  and  then  retrieving  it  from,  the  memory  of  every 
node  along  the  way.  (A  more  complete  description  of  these  two  techniques,  together 
with  experimental  results,  is  given  in  Appendix  B).  For  the  algorithms  employed  in 
this  research,  almost  all  communications  were  “nearest  neighbor”  in  the  hypercube. 
In  this  case,  the  two  approaches  to  message  routing  are  insignificant  and  the  nearest 
neighbor  performance  becomes  more  important. 
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3.  Concurrent  Computing  and  Communicating 

The  nodes  of  a  multiprocessor  machine  should  be  able  to  both  compute 
and  communicate  efficiently  and  concurrently.  This  is  no  small  undertaking.  The 
computing  side  must  access  memory  to  accomplish  its  mission,  but  the  message¬ 
passing  begins  by  drawing  data  out  of  memory  and  ends  by  storing  data  into  mem¬ 
ory.  Therefore,  at  a  minimum,  we  have  competition  related  to  memory  accesses. 
Furthermore,  the  computing  and  communication  must  be  synchronized  to  some  ex¬ 
tent.  The  algorithms  used  in  this  research  used  blocking  communications — described 
in  Appendix  E — which  enforces  synchronization. 

There  are  overheads  associated  with  communications  and  this  synchroniza¬ 
tion  problem.  Bryant  showed  how  transputers  perform  under  various  communica¬ 
tion  loads  [Ref.  26]  and  this  is  mentioned  in  Appendix  E.  The  issue  of  overheads 
is  one  that  Charles  Seitz  considered  for  the  “Cosmic  Cube.”  Much,  but  not  all,  of 
the  overhead  is  communication-related.  Seitz  listed  three  of  the  major  problems 
[Ref.  27;  p.  28]; 

(1)  the  idle  time  that  results  from  imperfect  load  balancing,  (S)  the  wait¬ 
ing  time  caused  by  communications  latencies  in  the  channels  and  in  the  message 
forwarding,  and  (3)  the  processor  time  dedicated  to  processing  and  forwarding  mes¬ 
sages,  a  considemtion  that  can  be  effectively  eliminated  by  architectural  improve¬ 
ments  in  the  nodes. 

Included  in  these  costs,  we  should  also  recognize  that  some  amount  of  time  is  required 
for  the  processor  to  perform  “context  switching”  (changing  jobs)  and/or  coordination 
with  a  special-purpose  processor  that  we  might  call  the  communications  manager. 

Although  the  issue  of  concurrent  communication  and  computing  is  a  very 
complex  one,  we  may  consider  significant  issues  that  are  related  to  the  efficiency  of 
communications  and  the  effect  upon  the  processor.  Geoffrey  Fox  presents  the  notion 
of  comparing  communications  ability  to  processing  ability  [Ref.  28:  pp.  50-51].  Let 
tcaic  be  “the  typical  time  required  to  perform  a  generic  calculation.  For  scientific 
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problems,  this  can  be  taken  as  a  floating-point  calculation  a  =  6  x  c  or  a  =  6  +  c.” 


Furthermore,  let  tamm  be  “the  typical  time  taken  to  communicate  a  single  word 
between  two  nodes  connected  in  the  hardware  topology.”  Then  the  ratio 

^comm 

^calc 

is  a  general  characteristic  of  a  particular  system  that  can  be  quite  useful  in  comparing 
machines.  Fox  uses  this  ratio  in  much  of  the  rest  of  his  work. 

A  parallel  machine  must  necessarily  possess  a  capable  communications  sub¬ 
system,  but  this  is  not  enough.  The  program  should  also  make  prudent  use  of  the 
communications  facilities.  This  means  that  the  programmer  and/or  compiler  must 
exhibit  a  good  understanding  the  machine’s  communications  abilities  and  weak¬ 
nesses.  Some  characteristics  are  nearly  universal.  Most  machines,  for  instance, 
reward  the  use  of  long  messages  because  there  is  an  overhead — nearly  independent 
of  message  length  in  many  cases — to  sending  any  message.  Other  characteristics  are 
very  much  machine-dependent.  This  means  that  the  programmer  should  be  rela¬ 
tively  familiar  with  the  communications  abilities  and  characteristics  of  the  target 
machine. 

4.  Accessing  the  Clock 

The  ability  to  accurately  measure  the  time  required  by  communications 
and  computations,  preferably  at  the  host  and  every  node  in  the  system,  is  absolutely 
essential  in  a  multiprocessor  environment.  Profiling,  in  a  sequential  program,  allows 
us  to  compare  the  time  required  by  various  parts  of  a  program.  Timing  in  a  parallel 
environment  allows  us  profile  the  code.  Thus  we  can  determine  the  time  required  for 
instructions,  loops,  functions,  or  communications. 

Profiling  is  an  even  more  important  practice  for  parallel  coding  than  it  is  in 
the  sequential  case.  The  only  way  for  a  parallel  program  to  be  useful  is  if  it  can  be 
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can  be  implemented  efficiently  upon  an  acceptable  number  of  processors.  That  is, 
in  general,  the  only  object  in  choosing  a  multiprocessor  system  over  a  sequential 
machine  is  the  speed  with  which  computation  can  be  performed.  One  of  the  best 
tools  available  to  the  parallel  programmer  is  the  ability  to  see  where  and  how  much 
time  is  being  spent. 

At  a  minimum,  we  need  the  ability  to  sample  a  clock  with  reasonable  preci¬ 
sion.  Both  machines  and  compilers  used  in  this  research  provide  this  capability  (see 
timing.h  in  Appendix  F  for  details).  The  transputers  offer  a  choice  of  frequencies: 
the  clock  associated  with  low  priority  processes  has  a  period  of  64  microseconds  and 
the  high  priority  clock  offers  one  microsecond  ticks.  The  iPSC/2  mcIock()  function 
gives  time  in  milliseconds. 

B.  METRICS  FOR  PARALLEL  COMPUTING 
1.  Complexity 

Perhaps  the  most  obvious  measures  for  a  parallel  algorithm  are  simply 
those  that  we  use  for  sequential  algorithms.  We  want  to  keep  time  and  storage 
requirements  to  a  minimum.  Perhaps  the  major  difference  in  complexity  analysis 
for  a  parallel  algorithm  is  that  we  are  primarily  interested  in  a  per-processor  notion 
of  complexity.  If  the  problem  has  been  farmed  out  in  a  fair  manner,  complexity 
analysis  for  the  parallel  case  is  merely  an  extension  of  the  sequential  case. 

Consider  the  matrix  A  €  Suppose  that  its  elements  are  8-byte, 

double-precision,  floating-point  values  (type  double  in  C).  Let  Mp  denote  the  total 
memory  (in  bytes)  required  to  store  A  on  p  processors  and  let  Tp  denote  the  time 
required  for  p  processors  to  solve  the  system  characterized  by  A.  Then  A/i  =  8n^ 
bytes  of  storage,  but  (ideally)  Mg  =  n^.  When  the  problem  is  distributed  across  p 
processors  simultaneously,  the  processors  can  share  the  storage  burden. 


Exceptions  abound.  For  certain  problems,  it  may  actually  be  convenient 
(faster  or  more  reliable)  to  store  the  entire  matrix  at  each  processor.  Nevertheless, 
in  most  cases  we  would  like  to  minimize  local  memory  requirements.  The  Gauss 
factorization  algorithm  considered  near  the  end  of  this  chapter  is  no  exception.  In¬ 
deed,  the  transputers  used  in  this  work  had  only  32  kilobytes  of  storage  each  and 
the  results  of  Chapter  VI  for  transputers  show  how  this  can  dictate  the  size  of  the 
problem  that  can  be  executed.  The  concepts  of  time  and  storage  complexity  have 
been  developed  in  detail  for  sequential  algorithms  and  they  seem  to  hold  a  place  in 
parallel  algorithm  assessment  as  well.  We  consider  other  measures  that  have  been 
developed  for  parallel  computing  in  the  following  section. 

2.  Contemporary  Measures 

The  concepts  of  speedup  and  efficiency  (Appendix  A)  are  two  of  the  most 
common  performance  measures  currently  associated  with  parallel  computing,  with 
the  ideal  case  (100%  efficiency)  yielding  tp=  ti/P  on  a  P-processor  system.  Selim 
Akl  proposes  the  following  criteria  for  analyzing  algorithms  [Ref.  29:  pp.  21-28): 

•  Running  Time:  Running  time  <(n)  is  the  time  required  to  execute  an  al¬ 
gorithm  for  a  problem  of  input  size  n.  Akl  lists  three  ways  to  express  this 
notion.  First,  we  may  count  the  steps  in  an  algorithm.  Akl  distinguishes  be¬ 
tween  computational  steps  (i.e.,  something  like  flops)  and  routing  steps  that 
are  associated  with  interprocessor  communication.  Second,  we  have  lower  and 
upper  bounds  (e.g.,  the  complexity  notation  presented  in  Appendix  A).  Fi¬ 
nally,  we  have  speedup.  Akl  gives  the  usual  definition  of  speedup  but  clarifies 
it  somewhat  (details  below). 
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•  Number  of  Processors:  Second  in  importance,  Akl  considers  the  number  of 
processors  required  by  an  algorithm.  He  uses  p{n)  to  denote  the  number  of 
processors  required  for  a  problem  of  size  n. 

•  Cost:  Akl  defines  the  cost,  c(n)  for  a  parallel  algorithm  as  the  product  of  the 
first  two  factors.  That  is,  c(n)  =  t{n)  x  p(n). 

•  Other  Measures:  In  this  category,  we  have  no  less  than  three  other  qualities 
of  a  parallel  system  that  deserve  consideration.  The  area  (i.e.,  chip  real  estate) 
required  by  the  processors  is  significant.  The  length  of  the  links,  as  well  as 
any  patterns  figures  in  (regularity  and  modularity).  And  finally,  the  period 
between  processing  different  elements  of  an  input  is  important. 

Apparently  metrics  for  parallel  computing  are  still  developing.  There  are  several 
very  useful  concepts  such  as  speedup  and  efficiency.  The  definition  of  speedup,  at  a 
first  glance,  is  rather  standard.  It  doesn’t  take  much  probing,  however,  to  find  that 
different  authors  make  different  assumptions.  Akl  defines  speedup  S  in  the  usual 
manner. 

s  =  r  (4.1) 

tp 

except  that  he  is  somewhat  more  specific  about  the  times.  He  defines  tj  eis  the 
“worst-case  running  time  of  fastest  known  sequential  algorithm  for  problem”  and  tp 
as  “worst-case  running  time  of  parallel  algorithm.”  [Ref.  29 ;  p.  24]  He  has  been 
more  specific  than  most  authors,  but  it  seems  likely  that  the  algorithms,  method  of 
obtaining  times  <i  and  tp,  and  systems  should  also  be  specified.  Speedup  is  defined 
loosely  in  most  cases.  A  parameterization  to  accompany  speedup  would  be  tedious, 
but  useful.  Until  speedup  becomes  a  standard  term  with  accepted  meaning,  we  shall 
have  to  specify  exactly  what  it  means.  We  should  be  more  careful  with  this  term. 
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3.  Other  Ideas 


Akl  hais  appropriately  distinguished  between  computational  steps  and  rout¬ 
ing  steps.  The  term  floating-point  operations  (flops)  has  become  quite  popular  (along 
with  benchmarks)  and  this  is  a  useful  means  of  expressing  the  computational  ability 
of  a  machine  (for  floating-point  applications).  The  notion  of  routing,  however,  is 
somewhat  vague.  Nevertheless,  this  idea  must  be  addressed.  It  should  probably 
become  more  specific  as  we  talk  about  similar  machines. 

The  machines  used  for  this  research  were  MIMD  message-paissing  systems. 
We  can  get  much  more  specific  about  “routing  steps”  for  such  a  machine.  First,  using 
the  clock  as  a  stopwatch,  we  can  profile  any  segment  of  code  (including  calculations 
and/or  communications).  An  implementation  specific  version  of  Fox’s  tcommltcaic 
ratio  can  be  instructive.  It  is  important  to  apply  this  ratio  to  the  hardware  as  Fox 
defines  it,  but  it  is  equally  important  to  recognize  the  role  of  the  software  (algorithm). 
That  is,  for  some  specific  implementation,  we  should  be  interested  in  finding  some 
measure  of  how  much  time  is  spent  communicating  and  how  much  time  is  spent 
computing.  More  specifically,  a  careful  profile  could  be  made  of  a  program  in  the 
following  manner. 

The  ratio  of  cumulative  (i.e.,  over  the  execution  of  the  entire  program)  time 
spent  communicating  to  time  spent  computing  should  be  considered  as  a  first  cut, 
especially  if  performance  (efficiency)  is  weak.  Algorithms  such  as  Gauss  factorization 
are  executed  in  stages,  within  a  loop  of  some  sort.  In  this  case,  the  tcomm/^coJc 
ratio  per  iteration  is  an  interesting  figure  (and — if  the  loop  represents  most  of  the 
program’s  execution  time — this  should  be  approximately  equal  to  the  cumulative 
figure). 

When  possible,  the  analysis  of  communications  complexities  should  be  an¬ 
alyzed  carefully.  For  instance,  in  the  Gauss  factorization  code  that  is  presented  in 
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Appendix  F,  a  C  structure  is  used  to  relay  the  owner  (node  id)  of  a  pivot  and  the 
pivot’s  row,  column,  and  value.  This  structure  is  20-bytes  of  data  and  we  know 
the  pattern  with  which  these  structures  are  moved  about  during  the  course  of  the 
program.  It  is  important  to  quantify  communication  like  this  when  possible.  The 
vague  notation  should  lose  significance  in  the  presence  of  such  concrete  information. 

There  are  other  important  and  related  ideas.  The  frequency  and  volume 
of  communications  traffic  is  easy  to  determine  with  a  high  degree  of  accuracy  for 
algorithms  such  as  Gauss  factorization.  Once  again,  in  the  presence  of  this  kind 
of  information,  we  should  dispense  with  vague  concepts.  It  is  useful  to  consider 
something  like  a  pie  chart  showing  the  various  amounts  of  time  spent  on  each  portion 
of  the  major  loop  in  a  program.  Indeed,  this  was  a  part  of  the  development  of  the 
Gauss  code  given  in  this  thesis.  Tools  such  as  these  are  important  in  refining  parallel 
algorithms  and  streamlining  code. 

The  parallel  program  designer  must  consider  many  other  issues  regarding 
communications.  Graph  theory  notation  is  a  natural  tool.  A  link-by-link  analysis 
of  the  communications  over  the  course  of  a  program  is  not  out  of  the  question  (espe¬ 
cially  if  the  communication  is  merely  a  repetition  of  very  simple  messages).  Efficient 
use  of  the  topology  is  important.  We  should  consider  the  percentage  of  links  used, 
balancing  of  the  communications  load,  frequency  of  traffic  for  each  link  (often  the 
communication  comes  in  bursts  and  often  between  iterations  of  the  basic  algorithm), 
flow  rate  (in  bytes  per  second)  for  each  link  during  the  bursts  or  over  longer  periods 
of  time,  timelines  showing  dependencies,  and  other  specific  characteristics  of  commu¬ 
nications.  Analysis  should  be  done  on  a  per-stage  baisis  for  algorithms  that  exhibit 
iteration  (loops). 

Perhaps  most  importantly,  a  plan  for  interprocessor  communication  should 
begin  well  in  advance,  before  the  code  is  ever  w'ritten.  A  reactive  approach  is  neces¬ 
sary,  like  debugging  code.  But  a  proactive,  strong  design  effort  can  simplify  matters. 


81 


The  notion  of  communicating  sequential  processes  (CSP)  deserves  attention.  This 
model  is  due  to  C.  A.  R.  Hoare  [Ref.  30],  and  it  is  never  far  away  in  the  world  of  trans¬ 
puters.  There  is  a  very  close  relationship  between  transputers,  Occam  (their  native 
language),  and  CSP.  CSP  is  a  useful  paradigm  for  this  sort  of  (message-passing) 
machine.  When  possible,  a  problem  should  be  logically  separated  into  processes. 
The  division  of  the  problem  should  be  natural,  so  that  every  process  represents  a 
logical  group  of  tasks.  The  processes  are  allowed  channels  to  communicate,  and  these 
channels  are  implemented  as  either  links  in  hardware  or  buffers  in  memory  if,  for 
instance,  two  processes  on  the  same  processor  wanted  to  communicate. 

If  a  problem  is  designed  correctly,  we  should  have  substantial  amounts  of 
work  within  a  process  and  minimal  interprocess  communication.  If  the  processes  and 
channels  are  represented  as  the  nodes  and  edges  of  a  directed  graph,  we  can  make 
use  of  some  nice  tools  and  theorems  from  graph  theory.  For  instance,  we  should  like 
to  maximize  computation  and  minimize  communications.  One  natural  method  is  to 
begin  with  atomic  processes  and  start  to  build. 

Suppose  that  we  have  many  such  processes  (at  least  as  many  as  processors) 
and  we  represent  them  as  the  nodes  of  a  directed  graph.  We  can  assign  the  processes 
(nodes)  a  weight  that  reflects  some  form  of  computational  difficulty.  This  should  be 
a  fairly  concrete  number,  assuming  that  the  task  (process)  is  well-defined.  It  might 
be  the  number  of  flops  per  iteration,  for  example.  Next,  the  channels  should  be 
clearly  indicated  as  weighted,  directed  edges.  The  weight  should  usually  be  a  very 
concrete  number  as  well,  like  the  number  of  bytes  that  passes  along  that  channel 
between  each  stage  of  a  computation. 

This  model  gives  the  problem  the  sort  of  order  that  is  necessary  to  keep 
the  parallel  design  simple,  logical,  and  formal  (i.e.,  friendly  for  proof  of  program 
correctness).  Once  the  problem  has  been  expressed  in  such  a  manner,  there  are 
many  options.  For  example,  we  could  consider  minimum  cuts  of  the  flow  rates  to 


82 


decide  how  to  efficiently  apportion  processes  to  processors.  This  mapping  alone  could 
greatly  enhance  the  performance  of  code. 

It  seems  that  much  of  the  work  in  this  area  is  rather  imprecise  and  generally 
unacceptable.  Granted,  parallel  design  methodology  is  a  relatively  recent  problem 
but  it  can  be  improved  substantially.  Good  parallel  designs  that  consider  these  kinds 
of  issues  and  express  them  clearly  will  likely  be  in  high  demand  as  parallel  computing 
machinery  develops. 

C.  PARALLEL  METHODS 


The  wide-ranging  capabilities  of  contemporary  computing  machinery  are  evi¬ 
dent.  An  exhaustive  list  would  demand  pages,  but  most  readers  could  readily  name 
several  applications  that  bear  little  resemblance  to  each  other.  For  a  single,  very  spe¬ 
cific  machine  there  is  almost  no  limit  to  the  combinations  of  sequential  instructions 
that  it  may  carry  out.  Put  another  way,  a  particular  machine  can  be  designed  and 
built  in  a  few  months  or  years  depending  upon  the  level  of  sophistication  involved. 
But  the  different  types  and  purposes  of  software  that  may  be  created  to  run  on  that 
single  machine  are  nearly  limitless.  Consider  Householder’s  comments  on  the  art  of 
computation  [Ref.  17:  p.  1]; 

If  a  computation  requires  more  than  a  very  few  operations,  there  are  usually 
many  different  possible  routines  for  achieving  the  same  end  result.  Even  so  simple 
a  computation  as  ab/c  can  be  done  {ab)/c,  (a/c)i>,  or  a{bfc),  not  to  mention  the 
possibility  of  reversing  the  order  of  the  factors  in  the  multiplication.  Mathemat¬ 
ically  these  are  all  equivalent;  computationally  they  are  not  (cf.  \1.2  and 
Various,  and  sometimes  conflicting,  criteria  must  be  applied  in  the  final  selection 
of  a  particular  routine.  If  the  routine  must  be  given  to  someone  else,  or  to  a  com¬ 
puting  machine,  it  is  desirable  to  have  a  routine  in  which  the  steps  are  easily  laid 
out,  and  this  is  a  serious  and  important  consideration  in  the  use  of  sequenced  com¬ 
puting  machines.  Naturally  one  would  like  the  routine  to  be  as  short  as  possible, 
to  be  self-checking  as  far  as  possible,  to  give  results  that  are  at  least  as  accurate  as 
may  be  required.  And  with  reference  to  the  last  point,  one  would  like  the  routine  to 
be  such  that  it  is  possible  to  assert  with  confidence  (better  yet,  unth  certainty)  and 
in  advance  that  the  results  will  be  as  accurate  as  may  be  desired,  or  if  an  advance 
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assessment  is  out  of  the  question,  as  it  often  is,  one  would  hope  that  it  can  be  made 
at  least  upon  completion  of  the  computation. 

—  ALSTON  S.  HOUSEHOLDER 

Parallel  algorithms  are  combinations  of  sequential  ones,  so  their  complexity 
can  grow  quickly.  In  general,  the  hardware  issues  surrounding  parallel  problems 
are  mature  and  straightforward.  Software,  on  the  other  hand,  is  developing  and 
generally  difficult  to  use. 

In  addition  to  the  familiar  design  considerations  for  a  straightforward  sequential 
algorithm,  the  design  of  a  parallel  solution  must  specify: 

•  An  awareness  of  the  interaction  between  processing  and  communication.  Fre¬ 
quency  and  duration  (message  length)  of  communications  should  be  known,  if 
possible.  Additionally,  we  should  know  how  this  compares  to  the  frequency 
and  duration  (flops)  of  computing  work. 

•  A  plan  for  inter  processor  communication;  including  hardware  and  software. 

•  A  scheme  for  memory  usage. 

•  The  granularity  of  the  problem  (i.e.,  should  the  processors  be  given  larger  or 
smaller  “chunks”  of  work  at  a  time). 

•  Load  balancing  among  several  processors. 

•  A  method  for  accessing  input/output  resources. 

This  is  a  very  high  level  look  at  the  problem.  The  issue  of  communications  alone, 
can  be  more  than  half  of  the  problem.  The  simplicity  of  this  short  list  does  not  do 
the  problem  justice.  Correct  execution,  ais  in  the  sequential  case,  is  very  important. 
But  parallel  algorithms  are  subject  to  the  added  scrutiny  of  performance  data  (e.g., 
efficiency). 
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The  methodology  for  constructing  parallel  algorithms  is  a  very  creative  process, 
and  there  are  many  questions  that  can  be  asked.  Is  a  highly  efficient  parallel  solution 
possible,  or  is  the  problem  bound  by  dependencies  and  sequential  work?  What  is 
the  ratio  of  time  spent  communicating  to  time  spent  computing?  How  nearly  does 
a  given  algorithm  approach  the  optimal  solution?  What  would  happen  on  some 
other  number  of  processors?  Are  there  any  bottlenecks  that  can  be  eliminated? 
Nevertheless,  the  current  performance  of  parallel  machines  and  the  promise  of  fu¬ 
ture  architectures  is  more  than  adequate  motivation  to  continue  developing  these 
products. 

D.  ALGORITHMS 

With  the  preceding  concerns  in  mind,  let  us  consider  the  algorithm  for  Gauss 
factorization  that  was  used  in  this  work.  The  algorithm  is  given  at  a  very  high 
level  because  detail  can  be  gleaned  from  Chapter  V  and  from  the  actual  code  in  Ap¬ 
pendix  F.  The  first  consideration  for  GF  was  “How  should  the  work  be  distributed?” 
There  are  many  options.  The  matrix  could  be  distributed  by  rows,  or  columns,  or 
blocks.  The  method  chosen  in  this  case  wcis  a  distribution  of  the  columns  of  A  across 
the  nodes  of  the  machine.  The  columns  were  distributed  so  that  column  j  went  to 
processor  number  j  (mod  P)  in  a  P-processor  network. 

Such  a  distribution  scheme  seems  natural  for  several  reasons.  First,  the  work 
associated  with  the  Gauss  process  moves  toward  the  lower  right-hand  corner  of  the 
matrix  A  €  3?"*".  By  using  a  modulus  assignment,  and  assuming  that  n  P,  vfe 
have  a  situation  where  the  load  on  the  processors  is  nearly  balanced  for  most  of  the 
process.  Second,  a  column-oriented  assignment  places  the  pivot  column  on  a  single 
node  at  each  stage.  This  makes  division  by  the  pivot  value  a  simple  task.  It  is 
interesting  to  note  that  a  similar  distribution  of  A  by  rows  would  have  merit  as  well. 
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Once  the  matrix  has  been  distributed,  the  code  simply  moves,  in  a  synchronized 
fashion,  from  stage  to  stage  of  Gauss.  At  each  stage,  we  must  pivot  according  to 
some  strategy.  The  complete  pivoting  showed  especially  poor  performance  since  it 
involved  a  great  deal  of  communication  and  synchronization  between  stages.  The 
partial  pivoting  method  allows  us  to  determine  which  node  will  have  the  pivot  and 
much  less  communication  is  required  when  this  node  simply  broadcasts  the  pivot  and 
pivot  column.  After  the  pivot  node  divides  every  element  under  the  pivot  by  the 
pivot  value,  it  broadcasts  the  entire  pivot  column  to  every  other  processor.  When  the 
processors  obtain  the  pivot  column,  they  use  the  multipliers  to  perform  arithmetic 
in  the  Gauss  transform  area,  and  then  proceed  to  the  next  stage. 

The  following  algorithms  give  an  overview  of  the  programs  that  appear  in  Ap¬ 
pendix  F. 

Algorithm  4.1  (Parallel  GF:  Host)  At  this  level,  the  host  code  is  essentially  the 
same  for  both  partial  pivoting  and  complete  pivoting.  The  program  is  very  simple: 
distribute  the  columns,  and  then  accept  them  back  one-by-one.  Let  A  €  3?"*^"  be 
the  matrix  of  coefficients,  and  let  P  be  the  number  of  processors.  This  algorithm 
forms  the  modified  copy  of  A  by  overwriting  the  original  copy.  After  the  column 
is  returned  from  the  nodes,  we  have  the  factored  version  of  A  that  can  be  separated 
into  L  and  R  in  the  usual  manner. 

begin  GF  (Host) 
for  j  =  0  :  (n  —  1) 

send  A{\,j)  to  node  {j  mod  P) 

end  for 

for  r  =  0  :  (n  —  1) 

receive  A{:,r)  from  node  (r  mod  P) 
end  for 

end  GF  (Host) 
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Algorithm  4.2  (Parallel  GFPP:  Nodes)  Let  A  €  S"**"  be  the  entire  matrix 
(held  at  the  host).  This  algorithm  is  executed  on  each  node  in  a  P-processor  network. 
Let  the  node  number  be  N  and  let  An  €  be  the  local  copy  of  select  columns 

of  the  matrix  A  (where  ms  ^  m/P  is  the  number  of  columns  held  locally).  Let  Gn 
be  that  part  of  the  Gauss  transform  area,  G,  that  is  held  locally.  This  node  receives 
every  column,  j,  of  A  where  {j  mod  P)  =  N . 

begin  GFPP  (Nodes) 

for  j  =  0  :  {rriN  -  1) 

receive  column  and  place  in  Asi-,)) 
end  for 

for  r  =  0  :  (n  —  1) 

if  (r  mod  P)  —  N  (pivot  is  held  locally) 

perform  partial  pivoting 
broadcast  pivot  row  index,  s,  to  all  nodes 
perform  pivot  column  arithmetic 
broadcast  pivot  column  to  all  nodes 
else 

receive  pivot  row  index,  s,  and  perform  row  interchanges 
receive  broadcast  of  pivot  column 
end  if 

if  iV  =  0 

send  pivot  column  to  host 
end  if 

perform  arithmetic  in  Gjv 

end  for 

end  GFPP  (Nodes) 
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Algorithm  4.3  (Parallel  GFPC:  Nodes)  Let  A  G  S’"**"  be  the  entire  matrix 
(held  at  the  host).  This  algorithm  is  executed  on  each  node  in  a  P-processor  network. 
Let  the  node  number  be  N  and  let  As  ^  fhg  local  copy  of  select  columns 

of  the  matrix  A  (where  ms  ^  m/P  is  the  number  of  columns  held  locally).  Let  Gn 
be  that  part  of  the  Gauss  transform  area,  G,  that  is  held  locally.  This  node  receives 
every  column,  j,  of  A  where  {j  mod  P)  =  N. 

begin  GFPC  (Nodes) 

for  j  =  0  :  {ms  -  1) 

receive  column  and  place  in  Asi-,)) 

end  for 

for  r  =  0  :  (n  —  1) 

locate  best  (local)  pivot  candidate 

elect  pivot  (let  node  Np  hold  the  winner  of  the  pivot  election) 

if  {Np  =  N) 

broadcast  pivot  indexes,  {s,i),  to  all  nodes 
perform  pivot  column  arithmetic 
broadcast  pivot  column  to  all  nodes 
else 

receive  pivot  indexes,  (s,t) 

perform  permutations 

receive  broadc2ist  of  pivot  column 

end  if 

if  A’  =  0 

send  pivot  column  to  host 
end  if 

perform  arithmetic  in 

end  for 

end  GFPC  (Nodes) 
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V.  IMPLEMENTATION 


A.  ENVIRONMENT 

Chapter  IV  introduces  parallel  algorithms  for  Gauss  factorization  (GF).  The 
GF  algorithms  are  produced  for  partial  and  complete  pivoting  strategies.  All  of 
the  programs  associated  with  this  research  are  written  in  parallel  versions  of  the  C 
language  and  executed  on  two  types  of  machines  at  the  U.  S.  Naval  Postgraduate 
School.  The  Math  Department’s  iPSC/2  afforded  eight  of  Intel’s  CX  type  processors 
arranged  in  a  hypercube  topology.  The  Parallel  Command  and  Decision  Systems 
(PARCDS)  Laboratory  in  the  Computer  Science  Department  has  more  than  seventy 
transputers  available  for  the  experiments.  The  discussion  below  gives  a  more  exact 
description  of  the  material  and  equipment  used  in  the  work. 

1.  Hardware 

This  section  describes  the  machines  upon  which  the  work  was  carried  out. 
A  general  knowledge  is  assumed,  including  familiarity  with  the  Intel  80386  micropro¬ 
cessor,  80387  math  coprocessor,  and  INMOS  transputers.  Some  of  this  information 
is  provided  in  Appendix  B. 

The  hardware  used  in  this  research  represents  the  state-of-the-art  for  the 
mid-to-late  1980s.  These  machines  are  quickly  becoming  outdated — fitting  the  his¬ 
tory  of  computing — but  both  INMOS  and  Intel  have  more  recent,  competitive  prod¬ 
ucts  in  today’s  market  and  fine  prospects  for  future  machines.  So,  while  they  are 
a  bit  dated,  the  products  used  in  this  research  represent  important  contemporary 
parallel  architectures. 
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Figure  5.1:  Hypercube  Interconnection  Topology:  Order  n  <  3 

a.  Networks  of  Transputers 

The  majority  of  the  research  was  performed  upon  hypercubes  of  order 
n  G  {0,1,2, 3}.  These  are  the  usual  hypercubes  (see  Appendix  C)  and  each  is 
imbedded  in  the  3-cube.  Figure  5.1  shows  this  topology.  Some  of  the  transputer 
work  for  this  thesis  was  performed  by  a  network  of  sixteen  IMS  T800-20  transputers 
connected  in  nearly  hypercube  fashion  (Figure  5.2).  This  is  not  identical  to  the  4- 
cube,  so  it  will  be  called  the  hybrid  cube  (it  is  used  as  a  root  with  two  subtrees  that 
happen  to  be  3-cubes).  The  subtrees  of  the  hybrid  cube  can  be  distinguished  by  the 
first  bit.  One  of  the  3-cubes  has  labels  like  Oiix;  the  other  is  labeled  liii. 
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The  rationale  behind  building  the  hybrid  cube  is  purely  practical.  The 
transputers  have  only  four  links.  Assuming  that  we  define  nodes  of  the  hypercube  to 
be  a  single  transputer,  a  pure  hypercube  of  order  four  would  be  a  closed  interconnec¬ 
tion  scheme  with  no  opportunity  for  input  or  output  to  or  from  the  system.  Here, 
the  root  node  has  been  inserted  between  nodes  zero  (0000)  and  eight  (1000).  While 
this  deals  a  horrible  blow  to  the  elegance  of  hypercube  algorithms — particularly 
communications — it  can  be  used  effectively. 

The  hardware  for  the  hybrid  hypercube  is  configured  with  code  by  Mike 
Esposito  [Ref.  31].  This  gives  us  sort  of  an  unlabeled  version  of  the  structure  that 
appears  in  Figure  5.2.  To  make  use  of  this  configuration,  the  nodes  must  be  labeled 
in  a  logical  fashion.  The  Gray  code  (Appendix  C)  is  a  reasonable  choice  for  labeling 
the  nodes.  The  actual  labeling  is  accomplished  by  a  Network  Information  File  (NTF) 
when  the  transputers  are  loaded  by  the  Logical  Systems  C  Network  Loader,  LD- 
NET.  A  more  detailed  description  of  this  process  is  contained  in  the  file  named 
hyprcube.nif  in  Appendix  F. 

Networks  of  transputers  use  point-to-point  communications  across  bidi¬ 
rectional  links.  The  links  for  this  work  operate  at  20  megabits  per  second  (bidirec¬ 
tionally).  That  is,  ten  megabits  per  second  is  a  peak  unidirectional  transmission 
rate.  Current  transputer  implementations  employ  a  store-and-forward  approach  to 
message  passing  (see  Appendix  B)  for  multi-hop  transmissions. 

b.  Intel  iPSC/2 

The  iPSC/2  used  for  this  research  contained  eight  processors  of  the 
“CX”  type  (80386/80387  combination).  The  host  is  an  80386  -based  IBM-compatible 
personal  computer  running  AT&T  UNIX  System  V  (version  3.2).  The  nodes  run  a 
local  subset  of  UNIX  called  NX.  The  host  is  capable  of  supporting  many  users  at 
once,  but  each  node  only  supports  a  single-user. 
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Users  can  request  p  nodes,  where  p  =  2”  for  n  €  {0, 1,2,3}.  If  another 
user  does  not  already  have  the  requested  portion  of  the  cube,  the  request  is  granted. 
As  long  as  nodes  remain,  another  user  can  access  them.  For  instance,  one  user  could 
be  working  on  two  nodes  and — at  the  same  time — another  user  could  access  up  to 
four  others.  While  the  first  two  users  still  possessed  these  six  nodes,  a  third  user 
could  get  one  or  both  of  the  remaining  two  nodes. 

Unlike  the  transputers,  Intel  uses  a  direct-connect  circuit  switching  (see 
Appendix  B)  approach  to  multi-hop  communications.  There  is  an  overhead  associ¬ 
ated  with  setting  up  the  path  for  communication,  but  this  cost  is  nearly  the  same 
regardless  of  how  many  hops  the  message  cross.  Once  the  circuit  is  established, 
the  message  can  proceed  directly  from  the  origin  to  the  destination  with  negligible 
interference  from  intermediate  nodes. 

c.  Host  and  Root 

The  notion  of  host  is  similar  on  both  machines,  but  there  is  a  slight 
difference.  The  Intel  hypercube  is  directly  connected  to  the  host.  The  transputer 
network,  however,  uses  a  substantially  different  protocol  than  the  typical  personal 
computer.  Transputers  employ  point-to-point  serial  communications,  using  an  li¬ 
bit  link  protocol  with  byte-by-byte  acknowledgment.  The  acknowledge  is  a  two-bit 
packet  with  dual  meaning.  The  receiving  transputer  has  begun  to  receive  the  byte 
and  it  has  storage  space  for  another. 

In  the  transputer  c«ise,  host  means  the  PC.  We  use  the  term  root  trans¬ 
puter  to  identify  the  transputer  within  the  host  PC  that  acts  something  like  a  host 
to  the  attached  network  of  transputers.  Figure  5.1  illustrates  this  configuration.  An 
IMS  B004  extension  board  in  the  host  PC  holds  a  T414  root  transputer.  The  B004 
is  plugged  into  the  PC’s  bus  and  a  parallel-serial  converter  lies  between  the  PC  and 
the  T414.  In  Figure  5.1  the  “host”  is  a  PC  and  the  “root”  transputer  is  the  T414. 
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The  iPSC/2  host  is  simplified,  and  could  almost  be  thought  of  as  a  combination  of 
the  host  and  root  for  the  transputer  case.  Since  the  entire  thesis  uses  the  same  pro¬ 
grams  for  both  machines,  the  root  and  host  terminology  can  become  confusing.  As  it 
is  not  always  convenient  to  express  this  difference  in  painstaking  detail,  I  will  use  the 
terms  somewhat  loosely.  An  understanding  of  the  differences  between  the  machines 
should  serve  to  eliminate  confusion  in  every  case.  When  only  one  of  the  terms  {host 
or  root)  is  needed,  I  have  used  the  correct  term.  When  both  of  the  terms  apply,  I 
have  used  them  almost  interchangeably  and  they  should  be  interpreted  according  to 
the  machine  under  consideration. 

2.  Software 

The  software  for  this  research  was  written  in  the  C  language.  The  Logical 
Systems  C  product  (version  89.1  of  15  January  1990)  was  used  for  the  transputer 
implementation.  For  the  iPSC/2  work,  the  C  compiler  supplied  by  Intel  was  used. 

B.  COMMUNICATIONS  FUNCTIONS 

Prior  to  implementing  the  Gauss  algorithms,  a  substantial  communications 
package  was  constructed.  Most  of  the  code  for  communications  appears  in  the  files 
comm.h  and  comm.c  (see  Appendix  F).  As  expected,  the  header  file  provides 
definitions  for  manifest  constants  and  specifications  (declarations)  for  tb  functions. 
An  overview  of  the  functions  provided  in  this  file  is  is  useful  before  we  discuss  the 
Gauss  code  that  called  these  functions. 

The  cubecast()  function  supports  broadcasts  from  the  host  to  all  the  nodes 
of  a  hypercube.  Given  a  hypercube  of  order  n  €  {0, 1,2,3}  with  p  =  2"  processors, 
this  communication  is  completed  in  n,  or  log2(p),  stages.  This  has  some  utility 
in  a  3-cube,  but  imagine  the  impact  in  a  10-cube.  All  1,024  processors  in  the 
hypercube  would  have  the  message  after  10  stages  of  communication.  This  function 
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is  especially  useful  at  the  beginning  of  a  problem,  when  data  must  be  shipped  to 
each  of  the  workers  in  the  network. 

Often  we  need  to  gather  information  in  the  reverse  direction,  from  the  workers 
back  to  the  root.  The  coalesce()  function  is  one  way  to  accomplish  this  task.  If  no 
modification  was  necessary  at  intermediate  nodes,  this  operation  could  be  completed 
without  interference.  In  the  algorithms  that  I  used,  however,  there  w'as  occasion  to 
modify  the  information  along  the  way  back  to  the  root.  For  this  reason,  the  gathering 
is  accomplished  using  two  function  calls.  First,  information  is  coalesced  to  a  given 
node.  Upon  return  from  coalesce(),  the  data  exists  locally  and  may  be  operated 
upon.  When  the  data  is  ready  for  submission,  the  submit()  function  is  used  to  pass 
it  one  step  closer  to  the  root. 

A  modification  of  the  cubecast()  function  that  was  useful  for  the  Gauss  prob¬ 
lem  was  cubecast_from().  This  function  does  not  assume  that  the  host  is  the 
originator  of  the  broadcast.  Instead,  the  source  is  specified  as  the  first  argument  to 
this  function.  The  function  still  performs  the  broadcast  in  log2(p)  stages,  but  it  uses 
the  concept  of  a  direction  to  accomplish  this. 

The  concept  of  directions  in  the  hypercube  turns  out  to  be  a  fairly  useful 
one.  For  concreteness,  consider  the  3-cube  showm  in  Figure  C.2.  Starting  at 
any  given  node,  we  can  specify  a  direction  using  one  of  the  three  combinations 
d  €  {001,010, 100}.  Suppose  that  the  node’s  label  is  i  and  let  0  denote  the  exclu¬ 
sive  OR  operation.  Then  for  some  direction,  d,  the  number  (£©d)  is  the  label  of  the 
node  in  the  direction  d  from  the  node  (. 

This  concept  can  be  applied  in  general  in  a  hypercube  of  order  n  using  n-bit 
labels  for  the  nodes  and  some  direction  d.  The  possible  directions  are  all  the  n 
combinations  of  (n  —  1)  zeros  and  a  single  one  in  an  n-bit  number.  Accordingly, 
the  code  uses  directions  d  6  {1,2,4,. .  .2"“^}.  In  most  cases,  when  a  direction-by¬ 
direction  approach  is  desired  for  all  possible  directions,  we  start  with  one  and  use 
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the  C  left  shift  operator  (<<)  to  produce  the  other  directions  incrementally. 

These  functions  and  several  others  are  described  in  detail  in  the  code  of  Ap¬ 
pendix  F,  but  these  basic  ideas  give  us  a  rccisonably  good  introduction  at  a  level 
that  is  adequate  for  understanding  the  algorithms. 

C.  CODE  DESCRIPTIONS 

A  detailed  description  of  the  source  code  used  to  implement  the  algorithms  of 
Chapter  IV  is  given  in  the  header  file  gf.h.  This  header  file,  located  in  Appendix  F,  is 
used  by  both  the  partial  pivoting  and  complete  pivoting  codes.  The  code  for  GF  with 
partial  pivoting  can  be  found  in  gfpphost.c,  the  host  program,  and  gfppnode.c, 
the  node  program.  The  code  for  the  complete  pivoting  algorithm  is  similar  except 
for  the  election  of  pivots,  so  most  of  it  has  been  omitted  in  the  interest  of  saving 
space.  Only  the  elect_next_pivot()  function  remains  because  it  is  the  significant 
difference  between  the  partial  and  complete  pivoting  codes.  This  function  appears 
in  gfpcnode.c. 
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VI.  RESULTS 


A.  GAUSS  WITH  COMPLETE  PIVOTING 

The  host  code,  gfpchost.c,  and  the  node  program,  gfpcnode.c,  are  written 
to  provide  a  parallel  implementation  of  Gauss  Factorization  with  complete  pivoting. 
Since  the  columns  of  A  are  distributed  among  the  nodes  of  the  multiprocessor  system, 
the  selection  of  each  pivot  requires  communication.  The  selection  process,  in  this 
case,  begins  with  each  node  selecting  its  own  best  candidate  for  pivot.  Once  each 
of  the  nodes  has  made  this  choice,  an  election  is  held  to  select  the  best  candidate 
among  all  of  the  nodes. 

Implementation  details  for  the  election  process  are  described  in  the  source  code, 
so  a  detailed  description  is  not  given  here.  Nevertheless,  these  results  show  how 
communication — like  the  election  process — can  withstand  efficient  parallel  program¬ 
ming.  This  program  shows  how  parallel  performance  can  suffer  from  the  effects  of 
communications.  (Recall  Fox’s  icommllcaU  and  Seitz’s  three  components  of  overhead 
from  Chapter  IV). 

The  complete  pivoting  strategy  inserts  inefficient  communications  between  each 
stage  of  the  process.  The  communications  themselves  are  bound  to  be  inefficient  since 
the  election  process  finds  all  nodes  of  an  n-cube  participating  in  an  n-stage  exchange 
of  a  20-byte  structure  (pivot  candidates).  In  addition  to  the  use  of  small  messages, 
the  election  imposes  an  added  measure  of  synchronization  upon  the  problem.  This 
allows  the  processors  less  independence  and  forces  them  to  transition  between  “use¬ 
ful”  program  execution  and  communication  more  frequently.  This  transition  can 
become  burdensome  and  the  processor  can  eventually  find  little  time  to  perform 
calculations. 
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In  addition  to  the  election  process,  there  is  a  one-to-all  broadcast  from  the 
node  holding  the  pivot  to  inform  the  others  of  the  pivot  column  values.  With  an 
m  X  m  matrix  A,  this  message  is  essentially  a  column  of  m  double  precision  floating¬ 
point  values.  Doubles  for  this  implementation  were  eight  bytes  each,  so  this  is  a 
unidirectional  broadcast  of  8m  bytes  with  exponential  fanout. 

The  election  process — as  simple  as  it  appears — will  prove  to  be  an  obstacle 
that  opposes  efficiency.  Both  the  iPSC/2  and  transputer  systems  reward,  in  4,erms 
of  transmission  rates,  the  sender  of  long  messages.  Short  messages  are  essentially 
penalized  by  the  overhead  involved  in  setting  up  the  transmission  line  and  manager. 
Let  us  consider  the  results  of  this  complete  pivoting  strategy.  The  results  from  the 
iPSC/2  appear  first  followed  by  the  transputer  results.  The  largest  dimension,  n, 
that  is  recorded  is  n  =  176.  The  iPSC/2  machine  would  handle  larger  problems,  but 
this  seemed  pointless  since  the  performance  appears  to  approach  maximum  efficiency 
early. 

1.  Data  for  the  iPSC/2  System 

Table  6.1  shows  the  timing  data  for  execution  of  Gauss  Factorization  with 
complete  pivoting  on  the  Intel  iPSC/2  system. 
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TABLE  6.1;  EXECUTION  TIMES  FOR  GF(PC)  ON  THE  iPSC/2 


Time  (seconds)  on  a 

Hypercube  of  Order 

0 

1 

2 

3 

0.126 

0.097 

0.092 

0.155 

0.716 

0.674 

0.608 

0.744 

2.20S 

1.751 

1.616 

1.568 

4.627 

3.705 

3.239 

3.149 

9.246 

6.888 

5.895 

5.250 

14.888 

11.479 

9.770 

9.109 

23.686 

17.883 

15.206 

13.796 

36.123 

26.424 

22.326 

19.957 

49.227 

38.178 

31.421 

28.460 

70.546 

50.754 

42.087 

37.810 

89.210 

69.257 

56.803 

51.148 

115.473 

86.760 

72.346 

63.954 

150.915 

110.247 

91.966 

82.680 

182.475 

138.880 

114.486 

102.266 

224.458 

168.056 

139.587 

123.683 

282.491 

206.222 

170.650 

153.379 

339.076 

248.422 

208.745 

186.205 

385.623 

295.217 

241.564 

217.099 

468.763 

345.049 

281.972 

254.538 

527.953 

404.235 

331.653 

292.352 

636.004 

457.089 

381.597 

338.464 

723.596 

532.597 

449.745 

395.008 

Dimension 


TABLE  6.2:  SPEEDUPS  FOR  GF(PC)  ON  THE  iPSC/2 


Dimension 

(n) 

Speedup  on  a  Hypercube  of  Order 

1 

2 

3 

8 

1.373 

0.813 

16 

1.178 

0.962 

24 

1.261 

1.367 

1.408 

32 

1.249 

1.429 

1.470 

40 

1.342 

1.569 

1.761 

48 

1.297 

1.524 

1.635 

56 

1.324 

1.558 

1.717 

64 

1.367 

1.618 

1.810 

72 

1.289 

1.567 

1.730 

80 

1.390 

1.676 

1.866 

88 

1.288 

1.571 

1.744 

96 

1.331 

1.596 

1.806 

104 

1.369 

1.641 

1.825 

112 

1.314 

1.594 

1.784 

120 

1.336 

1.608 

1.815 

128 

1.370 

1.655 

1.842 

136 

1.365 

1.624 

1.821 

144 

1.306 

1.596 

1.776 

152 

1.359 

1.662 

1.842 

160 

1.306 

1.592 

1.806 

168 

1.391 

1.667 

1.879 

176 

1.359 

1.609 

1.832 

The  speedup  data  that  is  shown  in  Table  6.2  is  derived  from  these  execution  times. 
Speedup  was  calculated  using  the  usual  formula  (see  Appendix  A  for  details) 


II 

T, 


for  speedup  on  p  processors. 
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TABLE  6.3;  EFFICIENCIES  FOR  GF(PC)  ON  THE  iPSC/2 


Dimension 

(n) 

Efficiency  (percent)  on  a  Hypercube  of  Order 

1 

2 

3 

64.948 

34.332 

10.161 

53.155 

29.441 

12.024 

63.068 

34.169 

17.603 

62.451 

35.716 

18.370 

40 

67.122 

39.215 

22.015 

48 

64.852 

38.098 

20.431 

56 

66.225 

38.943 

21.462 

64 

68.354 

40.450 

22.625 

72 

64.470 

39.168 

21.621 

80 

69.498 

41.905 

23.323 

88 

64.405 

39.263 

21.802 

96 

66.548 

39.903 

22.570 

104 

68.444 

41.025 

22.816 

112 

65.695 

39.847 

22.304 

120 

66.781 

40.200 

22.685 

128 

68.492 

41.385 

23.022 

136 

68.246 

40.609 

22.762 

144 

65.312 

39.909 

22.203 

152 

67.927 

41.561 

23.020 

160 

65.303 

39.797 

22.574 

168 

69.571 

41.667 

23.489 

176 

67.931 

40.223 

22.898 

Given  the  execution  times  and  speedups  presented  in  Tables  6.1  and  6.2,  and  using 
the  formula 


(as  defined  in  Appendix  A),  we  can  determine  the  efficiencj"  of  p  processors  applied 
to  the  Gauss  problem.  This  efficiency  data  is  shown  in  Table  6.3. 
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Figure  6.1:  Efficiencies  for  GF  (PC)  on  the  iPSC/2 


Many  different  graphical  displays  of  this  data  would  be  interesting,  but  the  efficiency 
data  may  be  the  most  interesting  since  it  sort  of  captures  the  success  or  failure  of  a 
parallel  program  (i.e.,  poor  efficiencies  should  lead  us  to  question  the  parallel  nature 
of  the  algorithm).  Figure  6.1  shows  a  scatterplot  of  the  data  from  Table  6.3. 
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TABLE  6.4:  EXECUTION  TIMES  FOR  GF(PC)  ON  THE  TRANSPUTERS 


Dimension 

Time 

seconds' 

on  a  Hypercube  of  Order 

(«) 

0 

1 

2 

3 

4 

8 

0.0083 

0.0075 

0.0077 

0.0088 

0.0925 

16 

0.0481 

0.0392 

0.0373 

0.0372 

0.1236 

24 

0.1494 

0.1173 

0.1063 

0.1001 

0.1855 

32 

0.3417 

0.2580 

0.2220 

0.2132 

0.2947 

40 

0.6538 

0.4922 

0.4135 

0.3798 

0.4587 

48 

1.1158 

0.8202 

0.6934 

0.6397 

0.7041 

56 

1.2950 

1.0716 

0.9696 

1.0239 

64 

1.8940 

1.5688 

1.4046 

1.4407 

72 

2.2116 

1.9817 

1.9808 

80 

2.9560 

2.6529 

2.6248 

88 

3.9127 

3.4812 

3.4090 

96 

4.4808 

4.3812 

104 

5.6442 

5.4519 

112 

7.0388 

6.7087 

120 

8.5430 

8.1252 

128 

10.3300 

9.7532 

136 

11.6930 

144 

13.6538 

152 

16.1029 

160 

18.5176 

168 

21.4437 

176 

24.4684 

^mai 

48 

67 

92 

128 

176 

2.  Data  for  the  Transputer  System 


Using  the  same  methods,  the  timing  (Table  6.4),  speedup  (Table  6.5),  and 
efficiency  (Table  6.6)  data  for  the  transputer  system  is  determined.  Unfortunately, 
the  memory  limitations  of  the  transputers  used  for  this  work  prevented  comparisons 
for  large  problem  size.  Empty  portions  of  Table  6.4  signify  inavailability  of  data  (i.e., 
execution  failure  due  to  inappropriate  or  excessive  problem  size).  The  maximum 
problem  size  that  executed  successfully  for  each  configuration  is  listed  on  the  last 
line  of  the  Table.  Figure  6.2  shows  a  scatterplot  of  the  data  from  Table  6.6. 
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TABLE  6.5:  SPEEDUPS  FOR  GF(PC)  ON  THE  TRANSPUTERS 


Dimension 


Speedup  on  a  Hypercube  of  Order 


1 

2 

3 

4 

■BU 

1.074 

0.942 

0.090 

1.288 

1.290 

0.389 

mSM 

1.405 

1.493 

0.805 

1.539 

1.602 

1.159 

40 

1.328 

1.581 

1.721 

1.425 

48 

1.360 

1.609 

1.744 

1.585 

56 

1.363 

1.648 

1.821 

1.724 

64 

1.389 

1.677 

1.872 

1.826 

72 

1.691 

1.887 

1.888 

80 

1.734 

1.932 

1.953 

88 

1.743 

1.959 

2.001 

96 

1.975 

2.020 

104 

1.993 

2.064 

112 

1.996 

2.094 

120 

2.022 

2.126 

128 

2.030 

2.150 

136 

2.150 

144 

2.1 86 

152 

2.180 

160 

2.207 

168 

2.210 

176 

2.227 

104 


TABLE  6.6:  EFFICIENCIES  FOR  GF(PC)  ON  THE  TRANSPUTERS 


Dimension 

(«) 

Efficiency  (percent)  on 

a  Hypercube  of  Order 

1 

2 

3 

4 

55.556 

26.860 

11.775 

1.125 

61.356 

32.204 

16.130 

2.431 

63.693 

35.133 

18.662 

5.0.34 

66.224 

38.477 

20.029 

7.246 

40 

66.409 

39.526 

21.514 

8.908 

48 

68.017 

40.230 

21.803 

9.905 

56 

68.167 

41.190 

22.760 

10.776 

64 

69.431 

41.913 

23.406 

11.410 

72 

42.279 

23.592 

11.801 

80 

43.358 

24.155 

12.207 

88 

43.575 

24.488 

12.504 

96 

24.691 

12.626 

104 

24.916 

12.897 

112 

24.948 

13.088 

120 

25.279 

13.289 

128 

25.369 

13.435 

136 

13.440 

144 

13.662 

152 

13.623 

160 

13.795 

168 

13.812 

176 

13.917 
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*  Order  1  o  Order  2  +  Order  3  3 


Figure  6.2;  Efficiencies  for  GF  (PC)  on  Transputers 
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B.  GAUSS  WITH  PARTIAL  PIVOTING 


1.  Data  for  the  iPSC/2  System 


Table  6.7  shows  the  timing  data  for  execution  of  the  Gauss  Factorization 
(partial  pivoting)  codes  (gfpphost.c  and  gfppnode.c)  on  the  Intel  iPSC/2  system. 
The  speedup  data  that  is  shown  in  Table  6.8  is  derived  from  these  execution  times. 
Speedup  Wcis  calculated  using  the  usual  formula  (see  Appendix  A  for  details) 


for  speedup  on  p  processors.  Given  the  execution  times  and  speedups  presented  in 
Tables  6.7  and  6.8,  and  using  the  formula 


(as  defined  in  Appendix  A),  we  can  determine  the  effectiveness  (efficiency)  of  p 
processors  applied  to  the  Gauss  problem.  This  efficiency  data  is  shown  in  Table  6.9. 
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TABLE  6.7:  EXECUTION  TIMES  FOR  GF(PP)  ON  THE  iPSC/2 


Time  (seconds)  on  a 

Hypercube  of  Order 

0 

1 

2 

0.109 

0.130 

0.127 

0.155 

0.371 

0.359 

0.394 

0.493 

0.508 

0.489 

0.519 

0.624 

0.752 

0.673 

0.675 

0.782 

40 

1.055 

0.880 

0.834 

0.911 

48 

1.499 

1.144 

1.024 

1.067 

56 

2.019 

1.473 

1.248 

1.228 

64 

2.733 

1.878 

1.491 

1.402 

72 

3.646 

2.412 

1.872 

1.721 

80 

4.743 

3.040 

2.256 

1.989 

88 

6.053 

3.719 

2.644 

2.237 

96 

7.567 

4.547 

3.125 

2.560 

104 

9.431 

5.477 

3.698 

2.912 

112 

11.468 

6.561 

4.252 

3.237 

120 

13.847 

7.859 

4.933 

3.646 

128 

16.552 

9.211 

5.661 

4.070 

136 

19.619 

10.873 

6.590 

4.633 

144 

23.071 

12.632 

7.532 

5.170 

152 

26.982 

14.681 

8.940 

5.866 

160 

31.204 

16.869 

9.866 

6.539 

168 

35.865 

19.318 

11.143 

7.284 

176 

41.064 

21.990 

12.605 

8.084 

200 

59.453 

31.437 

17.598 

10.910 

225 

83.962 

44.076 

24.329 

14.701 

250 

114.319 

59.515 

32.410 

19.118 

275 

151.443 

78.652 

42.336 

24.512 

300 

195.822 

102.589 

54.138 

30.927 

325 

248.153 

127.840 

68.082 

38.418 

350 

309.241 

158.859 

84.072 

46.978 

375 

379.538 

194.599 

101.984 

56.280 

400 

459.740 

235.259 

122.946 

67.366 

425 

550.536 

281.312 

147.058 

80.439 

450 

653.070 

333.180 

173.748 

94.656 

475 

767.616 

391.136 

203.513 

110.243 

500 

894.705 

455.308 

236.483 

127.631 
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TABLE  6.8;  SPEEDUPS  FOR  GF(PP)  ON  THE  iPSC/2 


TABLE  6.9:  EFFICIENCIES  FOR  GF(PP)  ON  THE  iPSC/2 


40 

48 

56 

64 

72 

80 

88 

96 

104 

112 

120 

128 

136 

144 

152 

160 

168 

176 

200 

225 

250 

275 

300 

325 

350 

375 

400 

425 

450 

475 

500 


Efficiency  (percent)  on  a  Hypercube  of  Order 


1 

2 

3 

42.085 

21.499 

8.803 

51.743 

23.526 

9.416 

51.943 

24.470 

10.174 

55.911 

27.842 

12.019 

59.943 

31.615 

14.472 

65.544 

36.615 

17.563 

68.557 

40.453 

20.560 

72.764 

45.825 

24.365 

75.580 

48.698 

26.482 

78.023 

52.554 

29.804 

81.390 

57.228 

33.821 

83.218 

60.541 

36.955 

86.104 

63.762 

40.482 

87.402 

67.427 

44.287 

88.096 

70.175 

47.475 

89.849 

73.097 

50.832 

90.219 

74.430 

52.934 

91.323 

76.577 

55.781 

91.897 

75.451 

57.497 

92.492 

79.072 

59.651 

92.830 

80.469 

61.544 

93.372 

81.442 

63.498 

94.559 

84.462 

68.115 

95.247 

86.278 

71.393 

96.042 

88.181 

74.744 

96.274 

89.430 

77.230 

95.440 

90.427 

79.147 

97.056 

91.123 

80.742 

97.332 

91.958 

82.283 

97.518 

93.039 

84.297 

97.709 

93.484 

85.307 

97.851 

93.591 

85.552 

98.006 

93.968 

86.243 

98.127 

94.296 

87.037 

98.253 

94.584 

87.626 

110 
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Figure  6.3;  Efficiencies  for  GF  (PP)  on  the  iPSC/2 


Here,  again,  only  the  efficiency  is  plotted.  Figure  6.3  shows  a  scatterplot  of  the  data 
from  Table  6.9. 
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2.  Data  for  the  Transputer  System 


Using  the  same  methods;  the  timing  (Table  6.10),  speedup  (Table  6.11),  and 
efficiency  (Table  6.12)  data  for  the  transputer  system  is  determined.  Unfortunately, 
the  memory  limitations  of  the  transputers  (32  kilobytes  per  node)  used  for  this 
work  prevented  comparisons  for  large  (interesting)  problem  size.  Empty  portions  of 
Table  6.10  signify  inavailability  of  data  (i.e.,  execution  failure  due  to  inappropriate 
or  excessive  problem  size).  The  maximum  problem  size  that  executed  successfully 
for  each  configuration  is  listed  on  the  last  line  of  Table  6.10.  The  minimum  problem 
size  for  the  hybrid  cube  on  16  processors  was  one  where  the  dimension  of  A  was 
n  =  16. 
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TABLE  6.10:  EXECUTION  TIMES  FOR  GF(PP)  ON  THE  TRANSPUTERS 


Dimension 

(n) 

Time  (seconds)  on  a  Hypercube  of  Order 

0 

1 

2 

3 

4 

0.0906 

0.0904 

0.0906 

{MSI  J 

0.1126 

0.1101 

0.1102 

MlU 

0.1092 

0.1582 

0.1480 

0.1462 

0.1439 

0.2312 

0.2038 

0.1965 

I 

0.1889 

40 

0.3360 

0.2765 

0.2568 

0.2446 

48 

0.3782 

0.3402 

0.3149 

56 

0.5124 

0.4463 

0.4258 

0.4064 

64 

0.6911 

0.5863 

0.5505 

0.5196 

72 

0.7277 

0.6715 

0.6308 

80 

0.8976 

0.8147 

0.7560 

88 

1.0675 

0.9482 

0.8732 

96 

1.0581 

104 

1.2430 

112 

1.6129 

1.4551 

120 

1.8388 

1.6490 

128 

1.8585 

136 

2.1306 

144 

2.3606 

152 

2.6717 

160 

2.9846 

168 

3.2910 

176 

3.6606 

^max 

47 

66 

92 

176 
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TABLE  6.11:  SPEEDUPS  FOR  GF(PP)  ON  THE  TRANSPUTERS 


Dimension 

(n) 

Speedup  on  a  Hy 

percube  of  Order 

1 

2 

3 

4 

1.000 

0.997 

1.022 

1.017 

1.031 

1.082 

1.083 

1.099 

1.177 

1.184 

1.224 

40 

1.308 

1.333 

1.374 

48 

1.447 

1.493 

1.563 

56 

1.387 

1.592 

1.669 

1.748 

64 

1.448 

1.707 

1.818 

1.926 

72 

1.888 

2.046 

2.178 

80 

2.049 

2.258 

2.433 

88 

2.256 

2.539 

2.758 

96 

2.667 

2.920 

104 

2.853 

3.134 

112 

2.998 

3.323 

120 

3.219 

3.590 

128 

3.852 

136 

4.019 

144 

4.296 

152 

4.456 

160 

4.646 

168 

4.871 

176 

5.031 
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TABLE  6.12:  EFFICIENCIES  FOR  GF(PP)  ON  THE  TRANSPUTERS 


Dimension 

(n) 

Efficiency  (percent)  on 

a  Hypercube  of  Order 

1 

2 

3 

4 

50.111 

25.000 

12.459 

51.135 

25.544 

12.715 

6.445 

53.446 

27.052 

13.535 

6.871 

56.722 

29.415 

14.805 

7.650 

40 

60.759 

32.710 

16.667 

8.585 

48 

65.090 

36.180 

18.666 

9.772 

56 

69.334 

39.801 

20.859 

10.927 

64 

72.412 

42.678 

22.727 

12.039 

72 

47.193 

25.571 

13.611 

80 

51.228 

28.220 

15.206 

88 

56.392 

31.744 

17.235 

96 

33.343 

18.252 

104 

35.657 

19.589 

112 

37.475 

20.770 

120 

40.241 

22.436 

128 

24.073 

136 

25.116 

144 

26.849 

152 

27.850 

160 

29.036 

168 

30.447 

176 

31.441 
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Efficiency  (percent) 


Figure  6.4:  Efficiencies  for  GF  (PP)  on  Transputers 
Figure  6.4  shev.  G  a  scatterplot  of  the  data  from  Table  6.12. 
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VII.  CONCLUSIONS 


I  value  the  discovery  of  a  single  even  insignificant  truth  more  highly  than  all 
the  argumentation  on  the  highest  questions  which  fails  to  reach  a  truth. 

—  GALILEO  (1564-1642) 

A.  SIGNIFICANCE  OF  THE  RESULTS 
1.  Communications  and  Computation 

Perhaps  one  of  the  most  obvious  effects  that  can  be  noticed  in  the  results 
of  Chapter  VI  is  the  abysmal  performance  of  the  complete  pivoting  code  when  com¬ 
pared  to  the  partial  pivoting  implementation.  The  relatively  small  amount  of  extra 
communications  required  for  the  complete  pivoting  algorithm  seems  to  force  syn¬ 
chronization  delays,  thus  reducing  the  system’s  performance.  This  demonstrates  the 
criticality  of  balancing  communications  wdth  calculation  in  parallel  processing.  The 
conclusion,  for  this  problem,  is  that  parallel  designs  must  minimize  the  frequency  of 
synchronizing  events  and  minimize  the  communications  volume  on  occasions  when 
communication  is  necessary.  The  greater  the  amount  of  uninterrupted  work  that  a 
processor  can  accomplish,  the  better.  While  control,  i.e.,  blocking  communications, 
synchronization,  loop-by-loop  data  distribution,  is  necessary  it  will  have  adverse  im¬ 
pacts  on  performance.  The  individual  processors  of  a  multiprocessor  system  should 
be  granted  the  maximum  degree  of  independence  that  the  mission  will  allow. 

While  there  is  undoubtedly  some  room  for  improvement  in  the  complete 
pivoting  code,  it  would  appear  that  maximum  efficiencies  of  approximately  22%, 
40%,  and  70%  for  hypercubes  of  order  three,  two,  and  one,  respectively,  are  likely  on 
the  iPSC/2.  The  same  code  seems  to  be  headed  for  somewhat  better  performance 
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on  the  transputers,  but  with  the  shortage  of  memory,  it  is  difficult  to  extrapolate 
and  determine  the  direction  of  the  plots.  The  higher  order  cubes  appear  to  flatten 
at  about  the  same  efficiency  that  the  iPSC/2  showed  as  a  terminal  efficiency. 

The  partial  pivoting  code,  on  the  other  hand,  exhibits  the  kind  of  charac¬ 
teristics  that  w’e  like  to  see  in  parallel  code.  Both  systems  show  efficiencies  rising 
sharply  (again,  the  size  limit  for  the  transputers  is  unfortunate)  and  the  iPSC/2 
shows  some  very  nice  results  as  the  dimension  of  the  matrix  exceeds  about  250. 

B.  THE  TERAFLOP  RACE 

One  of  the  biggest  challenges  to  parallel  computing  today  can  be  found  in  the 
“teraflop  race’'.  There  are  at  least  three  competitors  with  teraflop  initiatives:  the 
United  States,  Europe,  and  Japan.  The  United  States  effort  centers  around  Intel 
with  projects  like  Touchstone  (Chapter  I).  The  European  effort  relies  on  the  T9000 
transputer.  Considering  the  three  to  five  year  old  technology  used  for  this  research, 
together  with  the  numbers  that  the  various  parallel  computer  designers  boast  today, 
it  seems  that  we  might  see  teraflop  performance  by  the  mid-1990s.  C.  Gordon  Bell 
claims  that  the  teraflop  is  conceivable  [Ref.  6:  p.  1099] 

Two  relatively  simple  and  sure  paths  exist  for  building  a  system  that  could 
deliver  on  the  order  of  1  teraflop  by  1995.  They  are:  (1)  A  fK  node  multicomputer 
with  800  gigaflops  peak  or  a  32K  node  multicomputer  with  1.5  teraflops.  (2)  A 
Connection  Machine  with  more  than  one  teraflop  and  several  million  processing 
elements. 

Current  products  suggest  that  INMOS  and  Intel  will  be  among  the  most  likely 
competitors.  Table  7.1,  adapted  from  Jack  Dongarra’s  report  [Ref.  8:  p.  20],  shows 
how  transputer-based  systems  compare  to  Intel  products.  This  Table  summarizes  a 
test  involving  the  solution  for  a  1000  x  1000  system  of  linear  equations.  The  proces¬ 
sors  used  for  my  thesis  show  floating-point  capabilities  of  0.37  Mflops  (T800-20)  and 
0.16  Mflops  (Compaq  386/20  with  80387)  in  Dongarra’s  report  [Ref.  8  :  pp.  14,  16]. 
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TABLE  7.1:  PARALLEL  MACHINE  COMPARISON 


Computer 

mm 

n 

Speedup 

E  f  ficiency 

Parsytec  FT-400 

1075 

4.90 

219.0 

.55 

Parsytec  FT-400 

1075 

256 

6.59 

163.0 

.64 

Parsytec  FT-400 

1075 

13.20 

81.4 

.81 

Parsytec  FT-400 

1075 

64 

19.10 

56.3 

.88 

Parsytec  FT-400 

1075 

16 

69.20 

15.5 

.97 

Intel  iPSC/860 

59 

32 

5.30 

11.0 

.34 

Intel  iPSC/860 

59 

16 

6.80 

8.7 

.54 

Intel  iPSC/860 

59 

8 

10.60 

5.6 

.70 

The  iPSC/860  illustrates  the  most  recent  technology  and  shows  excellent  uniproces¬ 
sor  performance  (6.5  Mflops)  [Ref.  8  :  p.  9].  The  T800  transputer  that  Parsytec 
used  is  somewhat  dated  and  will  soon  be  replaced  by  the  T9000.  Nevertheless,  the 
transputer-based  system  shows  good  parallel  performance.  The  of  execution  in 
the  experiments  of  this  thesis  also  indicate  that  the  T800  is  faster  for  floating-point 
calculations  than  the  386/387  combination  in  the  iPSC/2. 


C.  FURTHER  WORK 


My  research  suggests  many  areas  for  furlhe*  investigdllou.  The  method  of 
conjugate  gradients  shows  a  great  deal  of  promise  as  a  candidate  for  parallelization. 
Indeed,  it  was  the  original  aim  of  this  thesis,  but  the  development  of  other  portions  of 
the  code  required  a  great  deal  of  time.  The  parallel  CG  algorithm  should  be  relatively 
simple  to  code  and  holds  great  potential  with  respect  to  performance.  Additionally, 
it  possesses  a  nontrivial  derivation  and  the  theory  behind  the  algorithm  would  be 
interesting  to  develop. 

There  are  many  other  variations  on  Gauss  factorization  that  could  be  coded 
and  tested.  While  the  programs  presented  in  this  thesis  are  designed  in  an  effort 
to  produce  efficient  performance,  there  is  undoubtedly  much  that  might  be  done  to 
enhance  this  code.  Among  the  options:  at  a  very  basic  level,  we  could  begin  with 
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other  distributions  of  the  matrix  A.  A  block  method  or  row  method  may  actually 
yield  better  performance.  As  the  LINPACK  benchmarks  seem  to  use  blocks,  this  is 
probably  worth  pursuing. 

General  purpose  parallel  computing,  the  ability  to  rely  on  parallel  architectures 
for  general  purpose  computation  without  a  need  for  investigation  to  be  more  con¬ 
cerned  with  the  architecture  than  the  problem  being  computed,  still  requires  much 
work.  The  ability  to  use  parallel  architectures  as  a  computational  tool  to  solve 
problems  will  mark  an  increasing  maturity  in  this  field. 

Applying  object-oriented  design  and  programming  paradigms  to  the  parallel 
world  may  hold  a  great  deal  of  promise.  In  particular,  the  language  seems  to 

be  a  prudent  choice  for  parallel  programming. 

In  addition  to  the  more  practical  options,  the  study  of  parallel  theory  and  al¬ 
gorithms  seems  interesting  and  shows  a  great  need  for  development.  In  particular, 
this  field  seems  to  need  a  more-or-less  general  (at  least  for  MIMD  machines)  ap¬ 
proach  to  classifying  parallel  algorithms  and  specifying  their  performance.  As  noted 
in  Chapter  IV,  a  mixture  of  this  field  with  graph  theory  may  hold  a  great  deal  of 
promise. 

On  an  initial  glance,  the  use  of  the  Ada  programming  language  with  its  inbuilt 
tasking  constructs  might  seem  optimum  for  the  type  of  computing  investigated  in 
this  thesis.  Ada,  in  this  regard,  however,  is  optimized  for  use  with  shared  memory 
multiprocessors.  The  use  of  Ada  on  transputers  still  requires  much  experimentation 
and  better  tools.  Presently  only  one,  rather  expensive,  Ada  compiler  is  available  for 
transputer  use.  Its  required  use  of  occam  harnesses  makes  using  Ada  on  transputers 
awkward  at  best.  Further  research  is  needed  to  create  a  better  environment  for  Ada 
programming  on  transputers.  Given  the  significance  of  Ada  to  the  DoD  establish¬ 
ment,  this  should  become  a  priority.  The  inclusion  of  a  standard  math  package  and 
the  advent  of  Ada  9X  may  hold  some  promise  in  this  regard. 
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APPENDIX  A 


NOTATION  AND  TERMINOLOGY 


This  appendix  explains  the  shorthand  used  in  the  rest  of  the  thesis.  Con¬ 
ventions,  by  definition,  are  generally  accepted  rules  of  the  business.  This  would 
seem  to  obviate  the  need  for  further  discussion  of  conventions,  but  there  are  sev¬ 
eral  good  reasons  for  discussing  notation  and  terminology.  First,  the  notation  may 
not  be  conventional.  In  the  absence  of  convention  (or  when  the  foundation  that  it 
provides  is  inadequate)  a  more  substantiad  agreement  is  required.  Second,  even  for 
conventional  notation,  the  audience  may  be  diverse  enough  to  warrant  familiariza¬ 
tion.  The  following  discussion  provides  this  familiarity  and  gives  the  terms  of  an 
agreement  to  establish  the  meaning  of  the  words  and  symbols  used  in  the  rest  of 
the  work.  On  occasion,  neither  convention  nor  this  agreement  will  suffice.  These 
situations  will  be  handled  case-by-case  with  the  philosophy  that  clarity  should 
never  be  sacrificed  for  brevity. 

A.  BASICS 

Most  of  the  w'ork  deals  with  the  integers,  Z  (from  the  German  word  for  numbers, 
Zahlen),  the  set  of  real  numbers,  R,  and  the  complex  numbers,  C  .  Often,  the 
German  3?  is  used  to  represent  the  reals.  A  complex  number  is  a  number,  i  -i-  iy  = 
2  G  C,  that  has  a  real  part  {x  G  3?)  and  an  imaginary  part  {y  G  3f),  with  the  complex 
unit  i  =  v^— 1.  Sometimes  the  real  part  is  denoted  Re(2)  and  Im(2)  is  used  to 
represent  the  imaginary  part. 

A  scalar  is  simply  a  real  number,  and  is  usually  denoted  by  a  low’er-case  Greek 
letter.^  A  vector  is  an  ordered  set  of  scalars.  Lower-case  Latin  letters  like  6,  x,  and 
y  are  used  to  denote  vectors.  Sometimes  an  arrow  is  placed  above  the  name  of  a 
vector — like  x — to  emphasize  the  fact  that  it  is  a  vector. 


'The  Greek  alphabet  is  shown  in  the  Table  of  Symbols. 
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Matrices  are  two  dimensional  and  usually  contain  real  or  complex  elements. 
Capital  letters  (Greek  or  Latin)  are  used  to  represent  matrices.  Common  examples 
include  A,  P,  Q,  R,  A,  and  E. 

The  number  systems  introduced  above  cannot  be  represented  in  a  finite  space. 
There  are  two  basic  problems.  First,  we  should  consider  the  size  (or  cardinality)  of 
the  sets.  The  integers  are  countable  or  denumerable  since  ther#*  exists  a  one-to-one 
mapping  between  Z  and  the  natural  numbers,  N.  This  is  an  advantage  in  finite 
storage  since  it  means  that  we  can  choose  a  finite  range  of  the  integers  and  be  quite 
certain  that  every  integer  in  that  range  is  represented  (exactly).  Even  though  Z  is 
denumerable,  it  is  a  set  with  infinite  cardinality. 

The  real  numbers  present  a  more  difficult  situation  for  finite  storage.  The  real 
number  line  is  dense  in  comparison  to  the  integers.  is  not  only  an  infinite  set,  it  is 
not  countable  (i.e.,  §?  is  uncountable).  It  is  said  to  have  the  power  of  the  continuum. 
To  represent  a  real  number,  i,  we  use  the  floating-point  approximation,  fl(x),  to  x. 
This  is  a  number  that  may  be  described  by  three  parts:  the  sign  s,  the  exponent  e, 
and  the  mantissa  d.  An  illustration  of  such  a  number  is  provided  in  Chapter  II. 

B.  COMPLEX  NUMBERS 
1.  Notation 

The  previous  section  introduced  one  notation  for  complex  numbers;  namely, 
2  =  X  +  iy.  There  are  several  other  representations,  each  of  which  makes  its  own 
contribution  in  practical  use.  Electrical  engineers  usually  replace  the  i  with  j  since  i 
is  used  to  represent  electrical  current.  Since  the  complex  number  can  be  represented 
by  an  ordered  pair  of  real  numbers,  the  graphical  notation  of  Figure  A.l  is  natural. 
In  this  plane,  the  real  and  imaginary  axes  are  used  to  represent  the  components  of 
a  complex  number. 
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Figure  A.l:  The  Complex  Plane 


The  vector  sum  of  these  two  parts,  z  =  x  +  y,  is  an  equivalent  and  useful 
way  to  model  complex  numbers.  There  is  yet  another  way  to  describe  z.  Let  r  be  the 
magnitude  of  the  vector  z  and  let  6  be  the  angle  measured  from  the  positive  real  axis 
counter-clockwise  to  z.  Using  this  notation,  we  could  use  trigonometry  to  describe 
the  complex  number  as  x  =  r(cos0  +  i  s\n9).  The  Euler  formula  [Ref.  32:  p.  74], 

e*  =  =  e'e'*'  =  e*(cosy  +  isinj/),  (A.l) 

can  be  used  to  convert  a  complex  number  to  yet  another  form:  z  =  re'^. 
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2.  Operations 


a.  Addition  and  Subtraction 


Addition  and  subtraction  of  complex  numbers  is  performed  in  the  same 
manner  that  vectors  are  added  or  subtracted.  For  instance,  let  zi  =  a  +  ib  and  let 
Z2  =  c  —  id.  Then  the  sum,  Zi  +  Z2,  is  the  same  as  the  sum  of  the  corresponding 


vectors: 


Zl  +  22 


a  +  c 
h-d 


(A.2) 


so  the  sum  is  zi  +  22  =  (a  +  c)  +  i{h  —  d).  Differences  are  handled  in  the  obvious  way, 
as  vector  differences. 


b.  Multiplication 

Multiplication  is  performed  by  applying  high  school  algebra.  For  the 
same  complex  numbers  Zj  and  22: 

Z]  X  22  =  (a  4-  ib)(c  —  id)  =  ac  —  (a)(id)  +  (ib)(c)  —  (ib)(id)  (A. 3) 

and  using  the  definition  of  the  complex  unit,  i  =  \/~l ,  we  may  combine  the  middle 
terms  and  move  the  =  —1  outside  the  last  term  to  find  the  (complex)  product: 

Zi  X  22  =  Gc  —  i(ad  —  be)  +  bd  =  (ac  +  bd)  —  i(ad  —  be)  (A. 4) 

c.  Conjugation 

The  complex  conjugate  of  a  complex  number  z  =  i  +  iy  is  defined  as 
z  =  X  —  iy.  This  simple  operation  finds  practical  application  in  complex  division. 

d.  Division 

Consider  the  quotient  (^1/2^2)  of  same  complex  numbers  that  were 
used  in  equations  A.2,  A. 3,  and  A. 4.  If  we  multiply  both  the  numerator  and  the 
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denominator  by  the  complex  conjugate  of  the  denominator,  Z2,  we  have: 


z\  _  a  +  ib  _  {a  +  ib){c  +  id)  _  ac  +  i{ad)  +  i{bc)  +  i^{bd) 
Z2  c  —  id  (c  —  id){c  +  id)  —  i^(P 

and  then,  by  applying  =  —  1,  we  conclude: 


£i 

22 


ac  —  bd-\-  i{bc  +  ad)  _  (ac  —  bd) 
~  (c2  +  cP) 


.(be  +  ad) 
'^\c^  +  <P) 


(A.5) 


(A.6) 


As  a  practical  matter,  this  is  not  the  way  we  would  compute  a  complex  quotient. 
The  code  given  in  Appendix  F  (function  cdiv()  in  complex.h)  provides  a  method 
that  is  better  suited  to  the  finite  precision  environment. 


C.  VECTORS  AND  MATRICES 
1.  Columns  and  Rows 


Vectors  are  ordered  collections  of  scalars  represented  as  columns.  Let 

a,  ^,7  €  C  with  a  =  1.0  +  *4.0,  /?  =  2.0  —  tS.O,  and  7  =  3.0  4-  i6.0.  Then: 

o'  1.0  +  t4.0  ' 

X  =  0  =  2.0  -  f5.0 

7  3.0  4  t6.0 

If  row-orientation  is  intended  the  transpose  is  used: 

=  [  Q  /?  7  ]  =  [  (1.0  4  f4.0)  (2.0  -  i5.0)  (3.0  4  f6.0)  ] 

Matrices  may  be  formed  as  ordered  combinations  of  elements,  vectors,  or  blocks. 
Suppose  that  p  =  3.0  and  1/  =  7.0.  Then,  with  x  as  given  above,  the  following 
matrices  are  equivalent: 

[1.0  4  24.0  3.0  4*12.0  7.0  4  *28.0' 

A  =  [  X  px  i/x  I  =  2.0  —  *5.0  6.0  —  *15.0  14.0  —  *35.0  (A. 7) 

3.0  4  *'6.0  9.0  4  *18.0  21.0  4  *‘42.0  J 

An  element  within  a  matrix  is  usually  denoted  A(i,j),  where  *  is  the  row  index  and 

j  is  the  column  index.  For  instance,  A(l,3)  =  7.0  4  *28.0  in  (A. 7). 
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A  block  of  the  matrix  A  is  a  rectangular  matrix  B  within  A.  MATLAB 
notation  is  useful.  For  instance,  B  —  A{i :  jik  :  1)  means  that  B  is  the  block  of  A’s 
rows  t  through  j  and  columns  k  through  1.  The  row  or  column  means  all  rows  or 
all  columns.  For  instance: 

1.0 +  *4.0  3.0  + *12.0  ■ 

B  =  A(:,1:2)=  2.0  -  i5.0  6.0  -  *15.0  (A.8) 

3.0  +  *6.0  9.0  +  *18.0  J 

As  a  sidenote,  a  number  with  a  decimal  point  should  usually  be  taken  as 
a  real  number.  Mathematically  speaking,  1  =  1.0.  But  many  compilers  treat  1 
as  an  integer  and  use  the  decimal  point  to  recognize  1.0  as  a  floating-point  value. 
Therefore,  all  of  the  code  associated  with  this  work  and  most  of  the  examples  use 
the  decimal  point  as  a  clue  that  the  number  is  a  real  number  or  its  floating-point 
approximation. 

2.  Conjugation  and  Transposition 

The  conjugate  of  a  vector  or  matrix  is  simply  a  vector  or  matrix  w’hose 
entries  are  the  conjugates  of  the  original  entries.  A  superscript  C  is  used  to  denote 
the  conjugate  of  a  vector  or  matrix.  For  instance,  with  A  as  given  A. 7, 

■  1.0  -  *4.0  3.0  -  *12.0  7.0  -  *28.0  ' 

A^  =  2.0 +  *5.0  6.0  +  *T5.0  14.0 +  *35.0  (A.9) 

3.0- *6.0  9.0-118.0  21.0- *42.0  J 

The  transpose  of  a  vector  or  matrix,  denoted  with  a  superscript  T,  refers  to 
a  transposition  of  its  rows  and  columns.  With  A  €  C’"*'",  the  effect  of  transposition 
is  that  A(*,  j)  =  A^{j,i)  for  all  *  such  that  1  <  *  <  m,  and  all  j  so  that  1  <  j  <  n. 
For  example,  consider  the  transposition  of  the  matrix  A  that  is  found  in  equation  A. 7. 

r  1  1.0 +  *4.0  2.0 -*5.0  3.0 +  *6.0 

A^=  /*i^  =  3.0  +  112.0  6.0- *15.0  9.0  + *18.0  (A.IO) 

i/x^  [  7.0  +  *28.0  14.0  -  *35.0  21.0  +  *42.0  J 
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In  this  example  we  see  that  the  columns  of  a  matrix  become  the  rows  of  its  transpose. 
This  example  also  demonstrates  that  when  we  first  transpose,  and  then  stack  the 
columns  of  a  matrix,  we  arrive  at  the  transpose  of  the  matrix.  In  the  event  that 
we  say  that  A  is  symmetric. 

The  conjugate  (or  Hermitian)  transpose  of  A  is  A^ .  This  matrix  is  the 

result  of  combining  the  conjugation  and  transposition  operations  on  A.  The  following 

example  shows  the  Hermitian  transpose  of  A: 

1.0 -*4.0  2.0  + *5.0  3.0- *6.0 

A”  =  3.0- *12.0  6.0  + *15.0  9.0  -  *18.0  (A.ll) 

7.0  -  *28.0  14.0  +  *35.0  21.0  -  *42.0  J 

If  A  =  A^,  we  say  that  “A  is  Hermitian.”  We  should  never  confuse  “+  is  Hermitian” 
with  “A  Hermitian”  (the  conjugate  transpose,  v4^,  of  A).  [Ref.  33:  p.  294] 

3.  Zeros 

It  could  be  argued  that  zero  is  the  most  important  number.  In  addition  to 
its  use  as  a  number,  zero  is  also  used  to  represent  a  vector  or  matrix  in  which  every 
element  is  equal  to  zero.  In  the  (extremely  rare)  event  that  the  context  does  not 
clearly  indicate  the  size  of  a  “0-vector”  or  “0-matrix”,  its  size  will  be  given  explicitly. 
In  the  absence  of  implied  or  specified  size,  0  should  be  interpreted  as  the  number 
zero.  Additionally,  blank  space  within  a  matrix  usually  means  that  all  elements  in 
that  region  are  zero. 

4.  Special  Forms 

a.  Axis  Vectors 

An  axis  vector,  e,,  is  simply  the  ***  column  (or  row)  of  the  identity 

matrix. 
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b.  Lower  Triangular 

A  lower  triangular  matrix,  usually  denoted  L,  has  the  form 


X 

X  X 
XXX 


(A.12) 


If  L  has  ones  on  the  diagonal,  it  is  called  unit  lower  triangular.  Similarly,  the  upper 
triangular  matrix  U  has  the  form 


U  = 


XXX 
X  X 
X 


(A.13) 


U  is  called  unit  upper  triangular  if  the  diagonal  elements  are  all  ones.  Sometimes 
(e.g..  Chapter  III)  such  a  matrix  is  called  right  triangular  and  denoted  R.  When  the 
matrix  is  not  square,  the  lower  and  upper  triangular  ideais  are  translated  to  lower  and 
upper  trapezoidal,  with  the  unit  trapezoidal  matrices  having  ones  on  the  diagonal. 
Thr  loliowing  matrices  illustrate  the  different  kinds  of  trapezoidal  matrices.  The 
matrices  may  be  tall  and  skinny  as 


U  = 


X  X 
X  X 
X 


L  = 


X 

X 

X 

X 

X 


X 

X 

X 

X 


X 

X 

X 


(A.14) 


or  short  and  fat 

U  = 

D.  NORMS 


X  X 
X 


X 

X 

X 


X 

X 

X 


X 

X 

X 


L  = 


X 

X  X 
X  X 


(A.15) 


The  information  below  was  taken  from  (Ref.  21 :  pp.  53-60],  so  it  seems  fitting 
to  begin  with  a  few  of  Golub  and  Van  Loan’s  comments  on  norms. 
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Norms  serve  the  same  purpose  on  vector  spaces  that  absolute  value  does  on 
the  real  line:  they  furnish  a  measure  of  distance.  More  precisely,  R”  together  with 
a  norm  on  Sf"  defines  a  metric  space.  Therefore,  we  have  the  familiar  notions 
of  neighborhood,  open  sets,  convergence,  and  continuity  when  working  with  vectors 
and  vector-valued  functions. 

1.  Vector  Norms 

a.  Definition 

A  vector  norm  on  3?"  is  a  function  f  :  ^  that  satisfies  the  following 

properties  [Ref.  21 :  p.  53]: 

f(x)  >0  I  G  3?",  (fir)  =  Oiff  X  =  0)  (A. 16) 

+  y}<  f{x)  +  fiy)  X,  y  €  31?"  (A. 17) 

/(qx)  =1  Q  I /(x)  Q6  3f,xG3l?"  (A. 18) 

We  denote  such  a  function  with  a  double  bar  notation;  /(x)  =  ||  j  |j. 

b.  The  p-Norm 

Subscripts  on  the  double  bar  are  used  to  distinguish  between  various 
norms.  The  most  popular  example  of  this  is  the  ;>-norm,  ||  •  ||p.  This  norm  is 
defined  by  [Ref.  21 :  p.  53] 

II  a;  l|p=  (I  xi  |P +  •••+ I  x„  1'’)^  p>l.  (A.19) 

The  2-norm  is  the  one  used  most  frequently  in  this  work,  but  the  1-  and  oo-norms 
find  frequent  application  in  other  work.  A  natural  representation  of  the  2-norm  is 
the  square  root  of  an  inner  product 

II  X  lb=  (I  X.  r  +  •  ■  •+  1 1„  p)=  =  (A.20) 

The  2-norm  of  x  is  the  Euclidean  length  of  the  vector  x. 
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2.  Matrix  Norms 


a.  Definition 

A  matrix  norm  on  3?"**"  is  a  function  /  :  —*  that  satisfies 

properties  similar  to  those  presented  in  the  vector  case  [Ref.  21:  p.  56]: 

/(A)>0  AeSi”**",  ifiA)  =  0iff  A  =  0)  (A.21) 

/(A  +  B)<  f{A)  +  f{B)  A, Be  (A.22) 

/(oA)  =1  a  I  /(A)  a  €  »,  A  €  (A.23) 

Matrix  norms  also  use  the  double  bar  notation:  /(A)  =  ||  A  ||.  The  Frobenius  norm 
and  the  p-norm  are  the  most  common  matrix  norms 


b.  Frobenius 


The  Frobenius  norm  is  defined  as 

II  A  Ilf = 


m  n 


A I  mz  I 

\  j=i 


..  12 


(A.24) 


c.  p-Norms 


The  p-norm  of  a  matrix,  A,  is  defined  by 

II  A  II  II  Ax  Up 

II  A  ||p=  sup  -j— - . 

1,40  I  X  p 


(A. 25) 


E.  LINEAR  SYSTEMS 

One  of  the  fundamental  tasks  of  linear  algebra  is  to  form  a  matrix  representation 
of  a  system  of  linear  equations.  Consider  the  system  of  linear  equations: 

2ui  +  3u2  —  4ii3  =  7 

3ui  —  5u2  +  7u3  =  3  .  (A. 26) 

4ui  +  6u2  ~  2u3  =  1 
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This  system  of  equations  can  be  expressed  using  the  matrix  notation  Au  =  b 

■  2  3  -4  ]  r  u,  1  r  7  ■ 

Au  =  ‘i  —5  7  ti2  =  3  =  6  (A. 27) 

4  6  -2  J  [  U3  J  [  1  J 

F.  MEASURES  OF  COMPLEXITY 

The  first,  and  most  rudimentary  requirement  for  an  algorithm  is  that  it  produce 
the  correct  answer.  This  seems  utterly  obvious,  but  it  must  never  be  lost  in  the 
algorithm  designer’s  pursuit  of  the  next  most  important  elements — efficiency  in  using 
time  and  space.  For  the  moment,  we  shall  assume  that  the  algorithm  arrives  at  an 
acceptable  answer.  Then  the  algorithm’s  use  of  time  and  space  becomes  a  very 
serious  subject.  Knuth  provides  the  notation  in  [Ref.  34]. 

The  time  complexity  o{  an  algorithm,  also  known  as  running  time,  describes  how 
the  program  works  under  a  stopwatch.  Space  complexity  is  the  amount  of  temporary 
storage  required  to  carry  out  the  algorithm.  For  example,  suppose  a  person  stood  at 
a  chalkboard,  ready  to  solve  a  problem.  We  would  not  regard  the  input  or  output 
storage  space,  but  only  the  required  space  on  the  chalkboard,  in  the  space  complexity 
of  the  problem.  Usually  we  like  to  link  the  idea  of  complexity  to  the  input  size  of  the 
problem,  n.  The  following  discussion  of  time  complexity  outlines  a  few  tools  that 
are  standard  in  the  study  of  algorithms.  The  same  tools  and  ideas  apply  for  space 
complexity  analysis.  [Ref.  35:  pp.  42-43] 

The  most  common  method  for  describing  the  time  complexity  of  an  algorithm 
is  the  “big-Oh”  notation  [Ref.  35:  p.  39].*  A  function  g{n)  is  0{f{n))  if  there  exist 
constants  c  and  N  so  that,  for  all  n  >  N,  g{n)  <  c/(n). 

^(n)  =  0(/(n))  g{n)  <  c/(n),  n>N  (A.28) 

^0{/(n))  is  read  “order  /(n).” 
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This  means  that  for  a  large  enough  problem  size  n,  the  time  to  execute  g{n)  is  a 
constant  multiple  of  some  function,  /(n).  Big-Oh  notation  does  not  mean  a  least 
upper  bound,  only  an  upper  bound  for  n  sufficiently  large.  Practically,  0{f{n))  must 
be  augmented  so  that  we  may  determine  how  tightly  c/(n)  bounds  ^(n). 

By  adding  a  lower  bound  to  big-Oh,  we  may  arrive  at  a  more  informative 
statement  concerning  an  algorithm’s  complexity.  This  is  achieved  through  the  use  of 
“big  Omega”.  T(n)  =  Q{g{n))  means  that  there  exist  constants  c  and  N  such  that, 
for  all  n  >  N,  the  number  of  steps  T{n)  required  to  solve  the  problem  for  input  size 
n  is  at  least  cg{n). 

T{n)  =  Q{g(n))  •<=>  T{n)  >  cg{n),  n>  N  (A.29) 

This  is  essentially  a  lower  bound  on  time  complexity.  If  a  function,  /(n)  satisfies 
both  /(n)  =  0{g{n))  and  f{n)  =  Q{g{n)) — not  necessarily  using  the  same  constants 
c  and  N  for  both  0  and  Q — then  we  say  that  /(n)  =  Q{g{n)).  [Ref.  35:  p.  41] 

/(n)  =  0(5(n))  =  Uigin))  «  /(n)  =  eig{n)),  n>N  (A.30) 

Now  and  then,  notation  similar  to  0  and  Q  is  required  except  that  a  strict  inequality 
is  desired.  In  this  case,  we  use  “little  oh”  and  “little  omega”.  The  definitions  are: 

/(n)  =  o{g{n))  4=>  \]r^  =  0  <=}►  ^(n)  =  w(/(n))  (A.31) 

We  have  seen  that  0,  Q,  0,  o,  and  w  are  roughly  equivalent  to  the  inequalities 
<,  >,  =,  <,  and  >,  respectively.  Is  this  notation  meaningful?  Does  it  have  utility  in 
problem  solving?  The  answer  is  a  guarded  “yes.”  We  must  understand  the  purpose 
of  the  notation.  It  cannot  substitute  for  timing  data  taken  from  the  actual  execution 
of  an  algorithm.  It  is  intended  as  a  good  first  estimate.  There  are  too  many  variables 
involved  in  modern  tools  and  machinery  to  expect  accurate  analysis  from  other  than 
actual  execution. 
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TABLE  A.l:  ALGORITHM  COMPLEXITY  AND  MACHINE  SPEED 


Algorithm 

Complexity 

Execution  Time  (in  Seconds)  for  Machine  Speed 

1000  steps/sec 

2000  steps/sec 

4000  steps /sec 

8000  steps/sec 

logjn 

0.01 

0.005 

0.001 

n 

1 

0.5 

0.125 

n  log2  n 

10 

5 

2.5 

1.25 

„1.5 

32 

16 

8 

4 

1,000 

500 

250 

125 

500,000 

250,000 

125,000 

i.r 

1039 

1039 

10“ 

10“ 

Nevertheless,  a  rough  estimate  of  how  a  problem  grows  is  important  to  the  prob¬ 
lem  solving  process.  Indeed,  experimental  results  and  complexity  analysis  should  not 
usually  be  considered  independently,  but  compared  and  used  as  complementary  in¬ 
struments.  The  time  complexity  of  an  algorithm  is,  in  a  sense,  more  important  than 
the  speed  of  the  machine  upon  which  it  is  executed.  Consider  the  data  in  Table  A.l 
(adapted  from  (Ref.  35:  p.  41]).  This  is  based  upon  a  problem  of  size  n  =  1000  and 
demonstrates  the  ability  of  an  algorithm  to  dominate  a  machine.  For  this  reason, 
and  with  these  conditions  clearly  established,  we  will  find  many  occasions  to  use 
time-  and  space-complexity  notation. 

Finally,  the  two  most  common  performance  measures  for  parallel  computing 
are  speedup  and  efficiency.  Suppose  that  T„  is  the  time  of  execution  for  a  particular 
algorithm.  A,  on  n  processors.  Consider  the  best  uniprocessor  time  Ti  for  a  sequential 
version  of  A  compared  to  the  execution  of  an  equivalent  (not  necessarily  the  same) 
parallel  program  on  P  processors  that  executes  in  time  Tp.  Then  speedup,  5p,  is 
defined  as 


and  the  eflBciency,  Ep,  is  defined  to  be 
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APPENDIX  B 
EQUIPMENT 


A  transputer  is  a  microcomputer  with  its  own  local  memory  and  with  links 
for  connecting  one  transputer  to  another  transputer. 

The  transputer  architecture  defines  a  family  of  programmable  VLSI  com¬ 
ponents.  The  definition  of  the  architecture  falls  naturally  into  the  logical  as¬ 
pects  which  define  how  a  system  of  interconnected  transputers  is  designed  and  pro¬ 
grammed,  and  the  physical  aspects  which  define  how  transputers,  as  VLSI  compo¬ 
nents,  are  interconnected  and  controlled. 

A  typical  member  of  the  transputer  product  family  is  a  single  chip  containing 
processor,  memory,  and  communication  links  which  provide  point  to  point  con¬ 
nection  between  transputers.  In  addition,  each  transputer  product  contains  special 
circuitry  and  interfaces  adapting  it  to  a  particular  use.  For  example,  a  peripheral 
control  transputer,  such  as  a  graphics  or  disk  controller,  has  interfaces  tailored  to 
the  requirements  of  a  specific  device. 

A  transputer  can  be  used  in  a  single  processor  system  or  in  networks  to  build 
high  performance  concurrent  systems.  A  network  of  transputers  and  peripheral 
controllers  is  easily  constructed  using  point-to-point  communication. 

—  INMOS 

This  introduction  is  provided  by  the  transputer’s  maker  in  [Ref.  36:  p.  7]. 

A.  TRANSPUTER  MODULES 


INMOS  makes  a  wide  variety  of  microprocessors  to  suit  differing  needs.  To 
provide  a  simple,  modular  interface  they  have  developed  the  notion  of  a  transputer 
module  (TRAM).  The  TRAM  is  a  small  board  containing  the  microprocessor,  RAM, 
other  circuitry,  and  a  standard  sixteen  signal  interface. 

B.  THE  IMS  B012 


Most  of  the  later  experiments  were  carried  out  on  an  IMS  B012  board.  This 
board  accommodates  sixteen  transputers;  each  of  which  is  installed  on  its  own  IMS 
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B401  TRAM.  In  our  case  the  TRAM  holds  32  kilobytes  of  memory  (in  addition  to 
the  four  kilobytes  onboard  the  T800-20  transputer). 

d.  INMOS  Transputers 

The  INMOS  transputer  gives  the  system  designer  a  tremendous  amount 
of  latitude.  With  these  processors — perhaps  more  than  with  any  other  parallel 
architecture — one  should  give  careful  thought  to  the  size,  component  processors,  and 
interconnection  topology  as  the  first  elements  in  designing  a  solution  to  a  problem. 
This  cannot  be  overemphasized.  When  the  hardware  is  not  “general  purpose”  in  na¬ 
ture,  it  must  receive  thoughtful  consideration  along  the  path  to  solving  the  problem. 
Some  of  the  largest  applications  for  parallel  machines — especially  for  transputers — 
are  embedded  systems. 

An  embedded  computer  system  is  defined  as  “one  that  forms  a  part  of 
a  larger  system  whose  purpose  is  not  primarily  computational.”  [Ref.  37;  pp.  15-16] 
To  automatically  accept  or  assume  a  particular  machine  configuration  is  to  relinquish 
control  of  one  of  the  tools  available  in  system  design. 

Transputer  is  the  name  given  to  the  members  of  a  family  of  microproces¬ 
sors.  While  INMOS  is  the  largest  producer  of  these  processors,  they  have  not  chosen 
to  protect  the  name  transputer  with  any  sort  of  trademark.  The  name  comes  from 
a  combination  of  “transistor  computer”  and  each  transputer  is  essentially  a  com¬ 
puter  on  a  chip.  The  chip  possesses  an  arithmetic  logic  unit  (ALU),  memory,  and  a 
communication  system  that  supports  bidirectional  serial  communication  links.  Most 
of  the  transputers  used  for  this  research  also  include  a  64-bit  (IEEE  754  standard) 
floating-point  unit  (FPU). 

The  transputer  module  (TRAM)  is  the  most  common  package  for  trans¬ 
puters.  The  capabilities  of  these  modules  are  quite  diverse,  but  they  hold  to  a 
standard  interface  design.  This  makes  the  TRAM  eiisy  to  use.  Systems  designed 
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around  TRAMS  enjoy  simple  replacement  of  components,  ease  of  modification,  and 
great  scalability.  Indeed,  the  laboratory  environment  in  which  these  TRAMs  were 
exercised  is  a  very  dynamic  one. 

The  PARCDS  laboratory  has  six  80286-based  IBM-compatible  personal 
computers,  each  of  which  contains  a  transputer  interface  board.  Five  hold  IMS  B004 
boards  and  one  holds  a  Transtech  TMB08  board.  The  B004  boards  each  have  two 
megabytes  of  memory  and  an  IMS  T414  transputer  in  addition  to  the  requisite 
serial-to-parallel  converter  and  interface  circuits.  The  TMB08  holds  four  megabytes 
of  memory  and  an  IMS  T800-20  transputer.  These  “host”  machines  can  each  be 
connected  to  an  arbitrarily  large  network  of  transputers. 

For  this  purpose,  we  have  two  INMOS  Transputer  Evaluation  Module 
(ITEM)  boxes.  These  boxes  can  hold  at  least  ten  boards  of  the  Double  Eurocard  size 
(approximately  22  cm  x  23.5  cm).  Of  primary  interest  for  this  thesis  was  the  IMS 
B012  board;  a  motherboard  capable  of  supporting  sixteen  TRAMs.  For  this  research, 
all  sixteen  slots  were  filled  with  a  TRAM  that  held  an  IMS  T800-20  transputer  and 
32  kilobytes  of  TRAM  memory  (in  addition  to  the  transputer’s  four  kilobytes).  The 
shortage  of  memory  is  probably  the  greatest  deficiency  and  indicator  of  the  outdated 
nature  of  these  processors.  TRAMs  with  four  and  eight  megabytes  of  memory  and 
IMS  T805-25  transputers  are  currently  available  for  less  than  S900.00  and  $1,300.00 
respectively. 


e.  Intel  iPSCn 

The  iPSC/2  used  for  this  research  contained  eight  node  processors  of 
the  “CX”  type  (80386/80387  combination).  Like  the  transputers,  this  machine  is 
somewhat  dated.  Today’s  i860  chips  have  exceedingly  more  capacity. 
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C.  SWITCHING  METHODS 


The  iPSC/2  and  transputer  hardware  use  of  different  switching  methods.  Intel 
uses  a  circuit  switching  appTO&ch,  whereas  the  INMOS  approach  is  store-and-forward 
switching.  Each  approach  has  advantages  and  disadvantages.  The  circuit  switching 
approach  is  “almost  universally  used  for  telephone  networks.”  [Ref.  38:  p.  12]  The 
idea  is  to  first  define  a  path  (close  a  circuit)  from  the  source  to  the  destination  aind 
then  use  it  as  a  dedicated  line. 

This  requires  a  start-up  overhead  that  depends  entirely  upon  the  current  load 
being  handled  by  the  system.  If  any  part  of  the  medium  (links  or  switches)  between 
the  source  and  destination  is  busy,  the  message  will  wait  at  the  source  until  the 
entire  path  is  clear.  The  path  is  determined  (in  the  iPSC/2  case)  in  a  deterministic 
fashion,  so  that  a  message  from  node  i  to  node  j  will  always  insist  on  a  particular 
path,  even  if  some  other  communication  is  blocking  that  path.  As  the  path  becomes 
clear,  switches  between  the  source  and  destination  are  set  so  that  a  dedicated  line 
will  exist  from  source  to  destination. 

After  the  overhead  of  establishing  (closing)  the  circuit  has  been  paid,  commu¬ 
nication  proceeds  at  a  rapid  rate.  The  intermediate  nodes  along  the  path  do  not 
store  the  message.  Instead,  their  switches  have  been  set  so  that  the  message  flows 
through.  Intuitively,  this  approach  should  be  quite  effective  in  a  network  with  a  very 
structured  interconnection  topology  and  a  relatively  small  number  of  nodes.  The 
hypercube  gives  us  this  structure.  Hypercubes  of  order  three  or  four  are  probably 
small  enough  to  avoid  difficulties  that  might  arise  as  many  nodes  contend  for  the 
same  medium. 

The  store-and-forward  approach  does  not  require  the  availability  of  the  entire 
path  between  source  and  destination  nodes.  Instead,  each  node  along  the  path  ac¬ 
cepts  the  entire  message  in  turn  and  then  forwards  it  to  the  next  node  in  the  path. 
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This  requires  the  use  of  no  more  than  one  link  at  a  time.  For  a  many-node  environ¬ 
ment  (particularly  if  there  is  little  structure  or  the  potential  of  dynamic  routing),  this 
approach  would  seem  to  offer  some  advantages  over  the  circuit  switching  approach. 

The  routing  criteria  is  separate  from  the  type  of  switching  used.  Either  of 
the  two  general  approaches  described  above  can  support  many  forms  of  routing. 
Deterministic  approaches  alone  include  many  methods.  For  the  hypercube  topology 
with  Gray-coded  node  labels,  it  is  probably  useful  to  combine  the  Gray  code  with 
the  notion  of  Hamming  distance  to  arrive  at  a  shortest  path  route.  Even  with  this 
approach,  there  are  as  many  optimum  paths  between  two  nodes  i  and  j  as  the 
Hamming  distance,  between  them.  [Ref.  39:  p.  7].  If  a  dynamic  scheme 

is  used  to  determine  the  path,  there  are  even  more  combinations  of  potential  paths 
from  i  to  j.  Usually  a  dynamic  approach  considers  media  utilization,  “hot  spot” 
avoidance,  and  so  on. 
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APPENDIX  C 

INTERCONNECTION  TOPOLOGIES 


Multiprocessor  computing  brings  with  it  a  fundamental  concern:  interproces¬ 
sor  communication.  Communication  is — to  any  designer  of  computing  machinery 
or  software — a  burden  and  hindrance.  An  interconnection  topology  describes  the 
network  that  handles  this  load.  The  hypercube  is  one  of  the  many  topologies  used 
in  multiprocessor  computing.  It  has  been  the  subject  of  both  hype  and  criticism. 
Nevertheless,  this  particular  scheme  possesses  the  qualities  that  quickly  draw  the 
attention  of  mathematicians  and  parallel  programmers.  The  hypercube’s  struc¬ 
ture  and  simplicity  make  it  dependable  and  predictable.  The  same  properties  that 
enable  the  hypercube  to  endure  the  rigor  of  mathematical  proof  lead  to  practi¬ 
cal  solutions  in  parallel  programming.  This  discussion  describes  the  hypercube 
topology  and  explores  some  of  the  the  qualities  that  make  it  a  practical  choice  for 
multiprocessor  computing. 

A.  A  FAMILIAR  SETTING 


Organizing  processors  into  a  suitable  topology  is  analogous  to  the  familiar  prob¬ 
lem  of  organizing  personnel  into  groups.  An  independent  worker  hais  limited  capacity, 
so  we  often  set  more  hands  (or  machinery)  to  the  task  for  productivity’s  sake.  Groups 
of  people  are  often  less  efficient.  Efficiency  is  a  ratio  of  time  spent  doing  useful  work 
to  the  total  time  spent.  Other  metrics  might  work,  but  iiwe  is  universally  recog¬ 
nized  as  the  standard  against  which  productivity  is  meeisurcd.  Dependence  upon 
others  requires  communication  and  consumes  time.  The  loss  may  be  mini¬ 
mized,  but  not  avoided.  Any  group  working  toward  a  common  goal  must  deal  w’ith 
this  problem.  To  be  efficient,  an  organization  must  possess  structure  and  media  for 
communication. 

People  spend  time  on  meetings,  paperwork,  and  peripheral  pursuits — all  for 
the  sake  of  an  organization  that  hopes  to  outperform  the  individual.  Organizations 
typically  perform  tasks  that  are  simply  impossible  for  an  individual.  To  be  sure,  an 
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individual  often  possesses  the  independence  and  efficiency  that  makes  him  the  proper 
choice.  There  are  tasks  that  seem  to  fit  one  or  the  other  and — while  there  is  some 
crossover  in  ability — we  aren’t  likely  to  get  rid  of  either  organizations  or  individual 
workers  soon!  This  is  worth  considerable  attention.  Individuals  and  organizations 
are  chosen  for  different  tasks. 

These  ide«is  apply  in  the  world  of  parallel  processing.  First,  there  are  many 
tasks.  Some  fit  nicely  onto  a  single  processor.  Others  beg  a  parallel  solution.  Finally, 
some  have  natural  solutions  by  either  method.  Even  when  one  of  these  options  is 
selected,  there  are  many  ways  to  solve  the  problem.  If  a  multiprocessor  is  used  to 
solve  the  problem,  the  issue  of  communications  will  be  unavoidable. 

An  interconnection  topology  must  carry  the  burden  of  interprocessor  communi¬ 
cations.  There  are  many  schemes  for  handling  this  mission.  This  discussion  focuses 
on  one  design  that  fulfills  that  mission;  the  hypercube.  To  forestall  confusion:  the 
subject  is  an  interconnection  topology,  not  a  particular  vendor’s  product. 

B.  APPEAL  TO  INTUITION 

Productivity  can  suffer  when  the  members  of  an  organization  communicate 
excessively.  A  lack  of  communication  can  also  reduce  efficiency.  In  a  network  of 
processc.  linos  of  communication  (links)  are  literal.  The  system  will  not  be  flexible 
if  there  is  a  shortage  of  links,  but  with  too  many  links  a  message  could  get  delayed 
or  lost  in  the  confusion.  The  hypercube  attempts  to  strike  a  balance. 

Hypercubes  come  in  different  sizes.  In  fact,  scalability  is  a  key  characteristic  of 
the  hypercube.  It  allows  the  designer  to  tailor  a  network  to  a  problem.  There  are 
several  ways  to  express  the  cube’s  size;  order  is  one  measure.  The  term  “hypercube 
of  order  n”  (usually  called  an  n-cube)  is  filled  with  meaning.  A  more  detailed  de¬ 
scription  is  given  later,  but  pictures  provide  the  most  direct  introduction.  Figure  C.l 
shows  hypercubes  of  order  n  where  n  €  {0,1,2, 3}. 
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Figure  C.l:  The  Four  Smallest  Hypercubes 


This  illustration  is  important.  The  hypercube  shows  geometry,  structure,  and 
symmetry.  A  few  observations  nearly  jump  out  of  the  pictures.  One  can  see  several 
terms  of  a  geometric  series  developing.  There  is  also  a  recurrence  relation  at  work 
in  the  building  of  hypercubes.  Intuition  suggests  the  use  of  well-oiled  mathematical 
tools  to  analyze  the  hypercube. 

C.  TOOLS 

Many  benefits  may  be  derived  from  a  few  definitions,  conventions,  and  tools 
(that  suit  the  hypercube’s  structure).  Figure  C.2  demonstrates  the  utility  of  Carte¬ 
sian  coordinates  in  n-dimensional  space. 

The  picture  is  deceptively  simple,  but  worth  careful  study.  Figure  C.2  shows  a 
unit  cube  in  three  dimensions.  The  vertex  labels  express  {xyz)  position  in  the  coor- 
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Figure  C.2:  Cartesian  Coordinates  for  a  3-Cube 


dinate  system.  The  labels  also  form  a  binary  {Gray)  code  that  is  somehow  equivalent 
to  coordinate  labeling  of  a  cube  in  n-dimensional  space.  The  issue  of  communica¬ 
tions  invoked  this  discussion,  so  distance  must  be  addressed.  A  comparison  of  the 
binary  labels  of  any  two  nodes  reveals  that  the  distance  between  the  nodes  is  equal  to 
the  number  of  bits  that  differ  in  the  labels.  This  measure,  called  Hamming  distance, 
and  the  Gray  code  are  presented  in  more  detail  later. 

This  brief  introduction  is  just  enough  to  embark  upon  a  more  precise  descrip¬ 
tion  of  the  hypercube.  The  ideas  of  a  coordinate  system,  node  labeling,  and  distance 
are  fundamental.  Graph  theory  also  finds  application  in  topology  design.  In  the  hy¬ 
percube  these  four  tools  complement  each  other  nicely.  Despite  their  simplicity  they 
can  be  explored  in  almost  endless  detail,  even  within  the  constraints  of  hypercube 
structure. 


142 


D.  DESCRIBING  THE  HYPERCUBE 


The  hypercube  interconnection  topology  cannot  be  captured  in  a  one-sentence 
definition.  A  definition  is  often  inappropriate  for  material  objects.  A  description 
given  from  several  perspectives  may  be  more  useful.  This  is  the  case  with  topologies. 
Each  tool  introduced  above  has  its  own  utility.  In  a  sense,  each  takes  up  a  particular 
perspective.  A  meaningful  characterization  of  the  hypercube  can  be  achieved  by 
combining  these  perspectives. 

The  geometric  view  is  most  useful  for  visualizing  the  cubes.  Despite  its  ten¬ 
dency  to  break  down  (with  three-dimensional  limitations),  geometry’s  intuitive  ap¬ 
peal  is  indispensable.  Geometry  and  pictures  lay  the  foundation  for  the  setting  of 
an  undirected  graph.  Figures  C.l  and  C.2  take  advantage  of  geometry,  but  three- 
dimensional  sketches  begin  to  lose  their  appeal  as  order  increases.  Nevertheless, 
geometry  and  visual  models  hold  an  important  place  in  describing  the  hypercube. 
They  furnish  us  with  (a)  examples  for  comparison,  and  (b)  expectations  that  are 
useful  in  the  transition  to  a  more  general  description  of  the  topolog}'. 

A  hypercube  of  order  n  may  be  described  as  a  set  of  2"  points  (vertices,  nodes, 
or  processors)  connected  by  a  set  of  edges.  The  points  are  each  given  an  n-bit 
binary  label,  6„  . .  .636261.  Thus  the  hypercube’s  node  labels  exhaust  all  possible  n- 
bit  binary  combinations.  Furthermore,  the  labeling  convention  used  in  Figure  C.2 
describes  the  point’s  n-dimensional  Cartesian  coordinates. 

The  hypercube  edge  set  (communication  links)  includes  an  edge  between  every 
pair  of  points  p,  and  pj  whose  binary  labels  differ  in  exactly  one  bit  position,  say  6*. 
That  is,  adjacent  nodes  have  a  Hamming  distance  of  one.  This  measure  of  distance 
proves  especially  convenient  in  the  hypercube,  and  it  can  be  thought  of  in  several 
equivalent  ways.  A  first  definition  of  Hamming  distance  is  the  number  of  bits  that 
differ  in  the  two  labels.  Equivalently,  it  is  the  number  of  I’s  in  a  bitwise  exclusive 
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or  (XOR)  of  the  numbers.  Figure  C.2  contains  an  example.  Let  pi  be  the  point 
labeled  100  and  pj  be  110.  The  binary  labels  differ  in  exactly  one  bit  position, 
namely  62  (the  second  bit).  The  points  are  neighbors  (one  hop  from  each  other  in 
communications  terms).  [Ref.  40] 

Despite  the  appeal  of  the  geometric  approach,  it  holds  limited  value  in  a  gen¬ 
eral  n-dimensional  space.  Consider  n  =  4  in  three  dimensions.  Typical  illustrations 
show  the  sixteen-node  cube  as  a  cube  inside  a  cube  with  connections  between  corre¬ 
sponding  nodes  of  the  inner  and  outer  cubes.  An  equivalent  diagram  would  display 
two  3-cubes  side-by-side  with  connections  to  corresponding  nodes.  Nevertheless,  it 
seems  that  an  n-dimensional  coordinate  system  is  the  most  convenient  environment 
for  sketching  the  hypercube  of  order  n. 

E.  GREATER  DIMENSIONS 

Three-dimensional  sketches  become  difficult  to  manage.  The  time  comes  for  a 
change  of  method.  Some  of  the  finest  tools  available  for  spanning  such  a  gap  are 
recurrence  relations  and  the  principle  of  mathematical  induction.  The  approach  is 
not  extremely  formal,  but  those  so  inclined  will  not  find  it  hard  to  add  the  formalities. 

Induction  can  be  used  to  generate  a  Gray  code  suitable  for  labeling  the  nodes 
of  a  hypercube.  This  code  and  the  Hamming  distance  can  be  used  to  determine 
the  cube.  The  first  topic  is  a  procedural  description  of  how  to  build  hypercubes.  A 
Gray  code  construction  procedure  will  follow.  If  the  two  topics  appear  similar,  it  is 
because  they  are  completely  equivalent  (assuming  that  the  Gray  code  is  combined 
with  the  concept  of  Hamming  distance). 

Constructing  a  hypercube  of  order  zero  is  trivial.  This  is  not  important  except 
that  it  leads  to  greater  things  (i.e.,  it  is  the  basis  for  induction).  Second,  suppose 
that  this  hypothesis  for  induction  is  true:  “we  know  how  to  construct  any  hypercube 
of  order  k  where  0  <  it  <  n”.  Induction  forms  a  hypercube  of  order  n  using  this 
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base  case  and  hypothesis.  This  can  be  done  in  three  steps: 

•  Replicate  the  Hypercube  of  Order  (n  —  1)  so  that  there  are  two  identical  copies. 

For  concreteness,  one  will  be  copy  number  0  and  the  other  will  be  copy  number 
1.  The  hypercubes  have  nodes  each. 

•  Prepend  the  copy  number  to  the  existing  node  labels.  That  is,  place  a  leading  0 
in  front  of  the  labels  for  each  node  of  copy  0  and  place  a  1  in  front  of  every  node 
label  in  copy  1.  Now  every  node  in  one  copy  has  a  corresponding  node  in  the 
other  copy.  These  corresponding  nodes  are  separated  by  a  Hamming  distance 
of  one.  That  is,  the  last  (n  —  1)  bits  are  the  same  for  corresponding  nodes  and 
they  differ  only  in  the  prepended  copy  number. 

•  Connect  all  nodes  whose  labels  differ  only  in  the  prepended  copy  number.  This 

adds  edges  between  the  two  copies. 

F.  GRAY  CODE  GENERATION 

The  procedure  above  generates  hypercubes.  By  focusing  on  the  vertex  labels. 
Gray  code  generation  can  be  discussed.  A  Gray  code  is  a  cyclic  list  of  all  of  the  n-bit 
numbers  which  changes  in  only  one  bit  from  one  number  to  the  next  [Ref.  40].  Since 
the  code  is  binary,  there  are  2”  numbers  in  the  list.  The  starting  point  is  arbitrary 
(it  is  cyclic)  but  I  have  started  with  zero.  Perhaps  the  best  explanation  of  Gray 
codes  comes  in  the  construction  of  one.  As  in  the  construction  of  hypercubes,  a  base 
case  is  required  to  begin  generation. 

•  Start  with  0.  This  is  a  one-bit  number  (n  =  1)  so  the  one-bit  Gray  code  must 
have  a  total  of  2*  =  2  numbers.  The  other  is  1.  Next,  the  hypercube  building 
steps  established  above  are  applied  with  slight  modification. 

•  Given  the  one-bit  case,  it  is  easy  to  generate  the  n  =  2  code.  Write  down  the 
previous  code  and  draw  a  line  below  it.  Next,  form  a  copy  by  reflecting  the  code 
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TABLE  C.l:  GRAY  CODE  GENERATION 


0 

00 

000 

0000 

1 

01 

001 

0001 

11 

oil 

0011 

10 

010 

0010 

110 

0110 

111 

0111 

101 

0101 

100 

0100 

1100 

1101 

nil 

1110 

1010 

1011 

1001 

1000 

downward  across  the  line.  Place  a  zero  in  front  of  each  number  in  the  previous 
code  (above  the  line),  and  a  one  in  front  of  each  number  in  the  new  copy  (below 
the  line). 

•  This  is  a  Gray  code  for  n  =  2.  Table  C.l  extends  the  idea.  The  list  is  cyclic, 
each  number  consists  of  n  bits,  and  the  list  contains  all  2"  possible  numbers.  To 
construct  the  code  for  larger  n,  the  process  may  be  applied  repetitively.  Copy 
by  reflecting  the  (n  —  l)-bit  code  downward  across  a  line,  prepend  a  zero  to 
everything  above  the  (most  recent)  line,  and  prepend  a  one  to  those  below  that 
line. 


The  Gray  code  is  probably  the  most  useful  node  labeling  to  attach  to  the  hyper¬ 
cube.  This  code  often  appears  in  implementation.  The  program  listing  that  begins 
on  page  152  shows  one  way  to  generate  the  code.  It  can  be  used,  for  instance,  as  the 
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backbone  of  a  routing  function  in  a  network.  Labels  with  a  Hamming  distance  of  one 
mark  neighbors  in  the  hypercube.  What  about  the  labels  of  two  nodes  that  differ 
in  exactly  k  bits  (i.e.,  have  a  Hamming  distance  of  k)l  It  turns  out  that  k  is  the 
distance  (number  of  edges)  between  these  nodes.  For  all  communications  between 
these  nodes,  the  shortest  path  will  involve  k  hops. 

This  also  indicates  that,  for  an  n-cube,  there  is  no  pair  of  nodes  that  have 
a  Hamming  distance  of  more  than  n  (e.g.,  communication  between  nodes  0000010 
and  1111101  in  a  7-cube  can  be  achieved  in  seven  hops).  The  greatest  distance 
across  the  n-cube  is  n  hops.  In  fact,  for  each  node  in  a  hypercube,  there  is  a  unique 
corresponding  node  at  a  Hamming  distance  of  n.  Also,  there  are  n  nodes  at  a 
Hamming  distance  of  one  from  each  of  the  hypercube’s  nodes. 

Two  approaches  have  been  considered  so  far:  sketching  cubes  in  n-dimensional 
Cartesian  coordinates  and  studying  the  labels  associated  with  the  cubes.  Though 
the  approaches  are  fundamentally  different,  they  arrived  at  many  of  the  same  conclu¬ 
sions.  Careful  application  of  the  Gray  code  and  Hamming  distance  could  produce  a 
nearly  endless  string  of  results,  but  it  is  more  convenient  to  introduce  some  material 
from  the  study  of  graphs  at  this  point.  Graph  theory  combines  the  two  approaches: 
it  looks  at  the  pictures  and  studies  the  numbers  as  well.  The  small  hypercubes 
described  with  earlier  methods  are  given  graph  representation  in  the  illustration  of 
Figure  C.3. 

G.  GRAPHS  OF  HYPERCUBES 

Graph  theory  is,  of  course,  much  more  sophisticated  than  the  small  subset 
used  here.  Buckley  and  Harary  provide  a  valuable  source  [Ref.  41].  This  discussion 
exposes  a  few  salient  features  of  the  hypercube  from  the  perspective  of  graphs. 

A  graph,  H,  consists  of  a  vertex  set,  V{H),  and  an  edge  set,  E{H).  The  vertices, 
or  nodes,  in  the  multiprocessor  network  model  are  the  processors.  The  edges  are  the 
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Figure  C.3:  Hypercube  Graphs 


links  that  connect  the  processors.  I  will  avoid  using  the  term  order  in  its  graph 
theory  sense  (i.e.,  number  of  nodes)  so  that  it  cannot  be  confused  with  the  order  of 
the  hypercube.  Consider  the  graph,  of  a  hypercube  of  order  n.  The  graph  has 
these  characteristics: 

•  There  are  2"  nodes.  This  means  that  the  number  of  nodes  (i.e.,  processors) 
grows  very  quickly  with  order. 

•  Every  vertex,  v,  in  H„  has  eccentricity  e(v)  =  n.  Eccentricity  is  the  distance 
to  a  node  farthest  from  v.  Additionally,  each  node  in  a  hypercube  has  exactly 
one  eccentric  (farthest)  node.  This  property  means  that  hypercubes  are  unique 
eccentric  node  (u.e.n.)  graphs. 
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•  The  radius  of  a  graph  is  the  minimum  eccentricity  of  the  nodes  and  diameter  is 
the  maximum  eccentricity.  The  hypercube  is  self-centered,  meaning  its  radius 
and  diameter  are  the  same:  r{Hn)  =  d{H„)  =  n.  This  is  significant  because  it 
says  that  worst-case  communications  distances  only  grow  like  the  order  of  the 
hyper  cube. 

•  Connectivity  is  a  measure  of  reliability  or  fault  tolerance  in  multiprocessor  net¬ 
works.  The  connectivity  of  a  hypercube  is  equal  to  the  order  of  the  cube,  n. 
The  edge  connectivity  is  also  n  (each  node  has  n  incident  edges). 

Counting  the  number  of  nodes  in  a  hypercube  is  easy.  The  h3'percube  construc¬ 
tion  process  also  points  to  a  recurrence  relation  that  reveals  the  number  of  edges 
in  a  hj'percube.  The  initial  case,  of  course,  is  the  hj’percube  of  order  zero  with  no 
edges.  After  this,  the  number  of  edges  can  be  expressed  in  terms  of  the  size  of  the 
previous  cube.  Suppose  a  hj’percube  of  order  n  has  q  edges.  Then  the  hypercube  of 
order  (n  -f-  1)  will  have  2q  -f  2"  edges.  This  is  because  the  construction  procedure 
calls  for  two  copies  and  2”  edges  between  them. 

Figure  C.4  provides  an  example.  This  is  the  graph,  of  the  hypercube  of 
order  four.  All  of  the  characteristics  given  above  are  evident.  Additionally,  a  Gray 
code  labeling  of  the  nodes  is  given.  The  recurrence  relation  above  is  useful,  but  it 
retains  a  dependence  upon  q.  A  more  convenient  formula  would  depend  on  n  alone. 

In  fact,  there  is  a  simple  formula  for  the  number  of  edges  in  the  graph  of  a 
hypercube,  but  it  requires  a  closer  look  at  the  recurrence  relation.  In  more  formal 
terms:  let  q{n)  represent  the  number  of  edges  in  a  hj'percube  of  order  n.  Then: 

.vJO  ifn  =  0 

~  \  2<?(n  -  l)-l-2<"-')  ifn>l  ‘ 

This  can  be  expanded  and  shown  equivalent  to:  q{n)  =  n(2^"“^^).  Table  C.2 
provides  an  example. 
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TABLE  C.2;  NODES  AND  EDGES  FOR  A  HYPERCUBE 


Order 

Number  of  Nodes 

Number  of  Edges 

0 

1 

0 

1 

2’  =  2 

2(0)  +  2°  =  1 

2 

2^  =  4 

2(1)  +  2’  =  4 

3 

2^  =  8 

2(4) +  2^  =  12 

4 

2'*  =  16 

2(12) +  2^  =  32 

5 

2®  =  32 

2(32)  +  2<  =  80 

6 

2«  =  64 

2(80)  +  2®  =  192 

7 

2^  =  128 

2(192)  +  2®  =  448 

(n  -  1) 

2(n-l) 

9 

n 

2" 

2q  +  2f"-U 
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Figure  C.4:  Graph  of  a  4-Cube 

H.  SOURCE  CODE  LISTINGS 

A  listing  of  the  Gray  code  generation  program  gray.c  follows. 
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gray.c 


1  / 
2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 

22  / 

23 

24 

25 

26 
27 
2s 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 


- -  program  iiformatioi  ========== - 

SOURCE  gray . c 

VERSIOV  1.2 

DATE  01  August  1991 

AUTHOR  Jon  Hartman,  U.  S.  Haval  Postgraduats  School 

USAGE  gray 

REFEREICES  : 

[1]  Hamming,  Richard  V.  "Coding  and  Information  Theory",  2nd  edition, 
edition,  Englevood  Cliffs,  H.J.:  Prentice-Hall,  1986,  pp.  97-99. 


- -  DESCRIPTION  ============== - 

This  program  generates  and  displays  the  Gray  code  described  in  [1] . 


- ALGORITHM  =============== - 

Consider  a  b-bit  Gray  code  beginning  at  zero.  Let  j  be  an  integral  index 
such  that  0  <=  j  <  b.  Consider  t»o  b-vectors,  mod.counterD  and  binD . 
Each  element,  mod.counter [j]  ,  holds  a  count  mod  (2*(j4-i)),  Initially  «e 
shall  set  mod.counter Cj]  =  (2*j).  Furthermore,  let  the  elements  of  binD 
represent  a  binary  number  in  the  natural  say.  That  is,  each  element, 
bin[j]  will  be  either  0  or  1,  and  binD  will  be  formed  so  that  the  sum, 

(  2*0  ♦  bin[0]  +  2*1  *  bin[l]  +  2*2  *  bin[2]  +  ...  ),  represents  the 
'value'  of  binD.  Ve  have  elected  to  start  the  code  at  zero,  so  let 
binD  be  set  to  zeros  initially.  Next  perform  this  algorithm: 

for  (i  =  0;  i  <  (2*b) ;  i++)  { 

Print  the  "binary  number"  represented  by  binD. 

for  (j  =  0;  j  <  b;  j++)  { 

Let  mod.counter [j]  =  (mod.counter [j]  +  1)  mod  (2‘(j+l)) 

If  mod.counter [j]  ==  0,  then  toggle  the  bit  in  bin[j] 

(i.e.,  bin[j]  =  (bin[j]  XOR  1)  ). 

}  end  for(j) 

}  end  ford) 


/ 
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61 

62 

53  tinclude  <8tdio.h> 

54 

55 

56 

57  «il&d«l  EXIT.FAILURE 

58  «deline  EXIT.FAILURE  1 

59  tendil 

60 
61 

62  »iXnd«l  SUCCESS 

63  «d«lin«  SUCCESS  0 

64  tendil 

65 

66 

67  tdeline  P0W2(n)  ((1)  «  (n)) 

68 

69 

70 

71 

72 

73  nainO  { 

74 


75 

76 

77 

int  patience  =  S; 

/♦ 

there's  a  limit  to  my  patience! 

♦/ 

long  b  =  0, 

/♦ 

as  in  b-bit  Gray  code 

•/ 

78 

♦bin, 

/• 

as  described  above 

•/ 

79 

80 

i. 

j. 

/♦ 

generic  integral  values 

♦/ 

81 

1. 

/♦ 

length  of  Gray  code  (2*b) 

♦/ 

82 

•mod. counter; 

/♦ 

as  described  above 

♦/ 

83 

84 

85  print!  ("\n\n\n\n\n\n - ====  "); 


86  printl("Tbis  program  generates  the  binary  numbers  of  a  Gray  code.  "); 

87  printl(*'==== - \n\n\n"); 

88 

89  printfC  Successive  numbers  in  a  Gray  code  differ  in  exactly  "); 

90  printfC'one  bit  position.  \n'') ; 

91 

92  printfC  The  list  generated  by  this  program  sill  be  complete.  "); 

93  printfC'That  is,  if  you\n"); 

94 

95  printfC  request  the  code  of  numbers  that  are  b-bits  long.  "); 

96  printfCyou  sill  get  a  list\n"); 

97 

98  printfC"  of  (2"b)  binary  numbers,  starting  with  zero . \n\n\n" ) ; 

99 
100 
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101  The  sol*  purpc:^e  of  this  vhileC)  loop  is  to  get  the  value  of  b  */ 

102  while  (b  <=  0)  < 

103 

104 

105 

106 

107 

108 
109 

no 
111 
112 

113 

114 

115 

116 

117 

118 

119 

120 
121 
122 

123 

124 

125 

126 

127 

128  } 

129 

130 

131  /♦  Allocate  storage  for  the  arrays ,  test  to  see  if  it  worked  ♦/ 

132  bin  =  (long*)  calloc  (b,  sizeof (long)) ; 

133  Bod.counter  =  (long*)  calloc  (b,  sizeol(long)) ; 

134 

135  if  ((!bin)  II  ( !niod_counter) )  •{ 

136 

137  printf  ("nainO  :  Allocation  failure  bin[]  or  nod.counter []  An")  : 

138  exit(EXIT_FAILURE); 

139  } 

140 

141 

142  /*  Initialize  mod. counter []  */ 

143  for  (i  =  0;  i  <  b;  i++)  mod.counterCi]  =  P0V2(i); 

144 

145  printf ("  Gray  code  for  */ild  bits  will  generate  ",  b); 

146  printf("'/.ld  numbers An\n\n",  1); 

147  printf ("  Press  RETURM  to  continue _ "); 

148  fflu8b(stdin) : 

149  i  =  getc(8tdin) ; 

150  printf ("\n\n\n") ; 


printf ("  Please  enter  desired  length  (binary  digits):  "): 
scanfC'Xd".  *b); 
fflusb(stdin) ; 
printf ("\n\n") ; 

if  (b  >  0)  {  /*  else  ask  again  (patience  permitting)  */ 

1  =  P0W2(b): 

if  (1  <=  0)  {  /*  guard  against  too  many  left  shifts!  */ 

printf ("  The  acceptable  range  is  "); 
printf  (“1 .  .y,d.  ",  (sizeof  (long)*8-2)) ; 

printf ("Please  try  again An\n\n") ; 

b  =  -1; 

> 

} 

if  ( — patience  <=  0)  { 

printf ("  Ran  out  of  patience ! \n" ) ; 

exit(EXIT_FAILURE): 

> 

/*  end  while  (b  <=  0)  */ 
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/*  Do  the  lorO  loop  spoken  of  in  the  “ALGORITHM"  section  above  */ 
lor  (i  =  0;  i  <  1;  i++)  { 


151 

152 

153 

154 

155  /e  Print  the  binary  representation  held  in  binQ  */ 

156  printf("\t") ; 

157 

158  for  (j  =  (b-1);  j  >=  0;  j--)  {  print! ("yad",  bin[j]);  } 

159 

160  print! ("\n") : 

161 
162 

163  /*  Adjust  the  counters  using  addition  mod  (S'Cj^l))  and  toggle  the 

164  e  corresponding  bit  in  binC]  uhenever  an  element  of  mod.counterD 

165  *  reaches  zero. 

166  */ 

167  for  (j  =  0;  j  <  b;  j++)  { 

168 

169  niod_counter[j]++; 

170 

171  if  ((mod_counter[j]  %=  P0W2(j*l))  ==  0)  bin[j]  *=  1; 

172  } 

173  >  /•  end  for(i)  ♦/ 

174 

175  free(bin); 

176  free(nod_counter) i 

177 

176  return(SUCCESS) ; 

179  } 

180  /• - =============  EOF  gray.c  ============= - ♦/ 
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APPENDIX  D 


A  SPARSE  MATRIX 


Partial  differential  equations  can  be  used  to  characterize  many  physical  prob¬ 
lems.  Explicit  solutions  to  these  problems  are  often  quite  complicated,  so  alterna¬ 
tive  approaches  warrant  our  attention.  Simple  matrices  exist  as  legitimate  repre¬ 
sentatives  of  complex  problems.  A  system  of  linear  equations  can  be  constructed 
to  give  a  discrete  approximation  to  the  problem.  The  structure  of  the  physical 
setting  guarantees  that  the  corresponding  matrix  of  coefficients  will  be  sparse  and 
symmetric.  Why  does  this  happen?  When  do  we  have  the  right  to  expect  such  a 
simple  matrix?  Where  does  the  matrix  come  from  and  what  does  it  mean? 

This  discussion  explains  how  to  construct  the  matrix  of  coefficients  and  vec¬ 
tors  that  describe  the  numerical  approximation  to  an  elliptic  partial  differential 
equation.  Poisson’s  equation  in  two  dimensions  is  used  to  demonstrate  the  process. 
The  first  step  uses  a  finite  difference  approximation  to  produce  a  system  of  equa¬ 
tions.  The  system  is  fine-tuned  and  the  matrix  of  coefficients  is  extracted.  The 
process  reveals  the  origins  of  structure  and  shows  why  the  matrix  is  sparse  and 
symmetric. 

A.  LAPLACE  AND  POISSON 


To  most  engineers,  mathematicians,  and  scientists,  Laplace  and  Poisson  are 
familiar  French  names.  Pierre-Simon  de  Laplace  (1749-1827)  and  Simeon  Denis 
Poisson  (1781-1840)  made  sizeable  contributions  to  several  fields.  In  a  moment,  the 
discussion  turns  to  partial  differential  equations  named  in  honor  of  these  gentlemen. 

If  the  material  seems  a  bit  difficult,  the  following  quote  from  [Ref.  42 :  p.  10] 
may  provide  some  encouragement.  The  ideas  are  not  so  obvious  to  everyone  as  they 
may  have  been  to  Laplace. 

Nathaniel  Bowditch  (1773-1838),  an  American  astronomer  and  mathemati¬ 
cian,  while  tmnslating  Laplace’s  Mecanique  celeste  in  the  early  1800s,  stated,  “I 
never  come  across  one  of  Laplace’s  ‘Thxts  it  plainly  appears’  without  feeling  sure 
that  I  have  hours  of  hard  work  before  me  to  fill  up  the  chasm  and  find  out  and  show 
how  it  plainly  appears.” 
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The  next  several  pages  are  dedicated  to  showing  how  the  matrix  representation 
of  a  partial  differential  equation  plainly  appears]  The  objective  is  to  describe  a 
particular  physical  problem,  then  convert  it  to  the  equivalent  matrix  representation 
using  a  deliberate,  step-by-step  approach. 

B.  EQUATIONS 

Laplace  and  Poisson  worked  with  partial  differential  equations  that  can  be  ob¬ 
served  in  nature.  What  kinds  of  natural  phenomena  can  be  described  with  partial 
differential  equations?  This  section  gives  a  brief  answer  to  this  question.  The  dis¬ 
cussion  includes  the  natural  setting,  the  equations,  and  a  quick  look  at  the  variables 
and  constants  involved.  The  link  between  the  equations  and  their  physical  meaning 
is  critical,  so  this  aspect  must  be  developed.  The  heat  equation  has  one  of  the  most 
intuitive  physical  interpretations  available,  so  it  is  used  as  a  starting  point.  After 
developing  a  general  perspective,  the  field  can  be  narrowed  to  a  particular  example — 
Poisson’s  equation.  Such  a  limited  survey  of  partial  differential  equations  can  only 
hope  to  succeed  by  appealling  to  the  reader’s  experience  and  intuition. 

1.  Heat 

Before  looking  at  a  partial  differential  equation,  let  us  recall  some  plane 
geometry.  The  intersection  of  a  plane  and  a  cone(s)  provides  many  interesting  shapes 
and  equations.  Consider  the  equation  that  describes  all  points  equidistant  from  a 
point  (focus)  and  a  line  (directrix): 

y  =  (^)  (D-1) 

Tl.la  is  a  parabola  whose  focus  and  vertex  both  lie  on  the  y-axis  (the  axis  of  the 
parabola  is  the  y-axis).  The  focal  length  is  c  and  the  vertex  is  located  at  (0,  A:). 
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Partial  differential  equations  are  classified  using  conic  sections  much  like 
equations  in  the  ly-plane.  Introductions  to  partial  differential  equations  often  begin 
with  the  heat  equation: 


du 


(D.2) 


This  is  an  example  of  a  parabolic  partial  differential  equation.  Note  the  similarity  of 
equations  (D.l)  and  (D.2). 


a.  Definitions  and  Notation 


The  heat  equation  describes  the  temperature,  u{x,t),  in  a  “thin  rod” 
(the  single  dimension  x  appears  in  the  equation).  The  presence  of  i  indicates  depen¬ 
dence  upon  time.  If  there  is  a  heat  source  (or  sink)  present,  it  is  represented  by  Q. 
We  can  see  that  Q  may  be  a  function  of  x  or  f  or  both.  W'hen  mass  density  (p), 
specific  heat  (s),  and  thermal  conductivity  (K)  are  known;  the  thermal  diffusivity, 
«,  can  be  determined  using  the  following  relation: 


K  = 


sp 


(D.3) 


b.  Houses  and  Heat 


From  our  youth,  we  have  observed  several  important  properties  of  heat 
flow.  The  lessons  are  simple,  few  in  number,  and  can  be  observed  from  the  comfort 
of  our  home.  First,  heat  energy  only  flows  when  there  is  a  difference  in  temperature. 
If  the  temperature  outside  is  the  same  as  the  indoor  temperature,  no  heat  energy  will 
cross  the  threshhold  (even  with  the  door  open).  A  temperature  difference  represents 
an  instability  and  heat  will  flow  to  counter  this  situation. 

When  heat  does  flow,  it  goes  from  hotter  to  colder  regions.  The  loss  of 
heat  energy  from  the  warmer  region  reduces  the  temperature  there,  and  the  tem¬ 
perature  in  the  colder  region  rises  as  it  gains  heat  energy.  The  transfer  of  heat 
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has  a  stabilizing  effect  (the  environment  will  not  be  at  rest  as  long  «is  temperature 
differences  exist).  We  do  not  find  the  changes  in  temperature  surprising,  but  our 
conversation  indicates  confusion  concerning  the  direction  of  the  flow.  Most  of  us  have 
heard  someone  say:  “Close  the  door,  you’re  letting  cold  air  in!”.  We  understand  that 
this  statement  is  not  correct,  but  it  seems  to  persist  from  one  generation  to  the  next. 

In  addition  to  the  idea  that  heat  flows  in  the  presence  of  temperature 
differences  (gradients),  we  clearly  understand  that  larger  differences  are  related  to 
greater  heat  flow.  On  a  very  cold  Winter  day,  the  parent  notices  more  quickly  that  the 
child  left  the  door  open  (and  displays  more  urgency  in  shutting  it).  In  other  words, 
the  effect  of  heat  flow  is  to  balance  differences  in  temperature  and  it  somehow  “works 
harder”  when  there  is  a  greater  difference  to  balance.  In  mathematical  terms,  we 
would  suspect  (correctly)  that  heat  flow  is  proportional  to  temperature  difference. 

Finally,  we  recognize  an  ability  to  restrict  heat’s  ever-present  balancing 
efforts.  Sometimes  we  want  an  imbalance  in  temperature,  and  we  often  use  insulation 
to  maintain  this  imbalance.  When  we  shut  the  door,  we  expect  that  it  will  slow 
the  transfer  of  thermal  energy  through  the  doorway  and  enable  us  to  maintain  an 
acceptable  imbalance  in  temperature.  For  the  same  reason  we  use  special  materials 
in  the  construction  of  refrigerators  to  keep  heat  out,  and  in  ovens  to  keep  heat  energy 
inside.  This  means  that  the  effectiveness  of  heat  transfer  is  subject  to  properties  of 
the  medium  (air,  glass  windows,  fiberglass  insulation,  wood  doors,  steel,  styrofoam, 
and  so  on)  through  which  it  flow's. 

c.  Heat  Flux 

The  right-hand  side  of  the  heat  equation  looks  a  bit  complex,  but  it 
merely  captures  this  idea  of  heat  flow'.  Before  tackling  the  second  partial  derivative 
of  u  with  respect  to  x,  think  about  the  first  partial  derivative.  The  first  partial 
derivative  of  u  with  respect  to  x  (scaled  by  the  thermal  conductivity,  K)  describes 
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movement  of  thermal  energy.  This  flow  of  heat  is  usually  called  heat  flux,  denoted 
<i>,  and  can  be  calculated  using  Fourier’s  law  of  heat  conduction: 

(D.4) 

Heat  flux  is  a  measure  of  how  much  thermal  energy  per  unit  time  is 
moving  to  the  right  per  unit  surface  area  (by  convention,  flow  to  the  left  is  assigned 
a  negative  value  and  flow  to  the  right  is  positive)  [Ref.  43:  p.  3].  The  second  partial 
derivative  measures  changes  in  flux  with  respect  to  position.  In  other  words,  it 
represents  increasing  or  decreasing  flux. 

d.  Heat  Equation  Summary 

Let  us  carefully  reassemble  the  pieces  of  the  heat  equation  (D.2)  to  see 
if  the  theory  agrees  with  experience.  Temperature  has  spatial  and  temporal  depen¬ 
dencies.  The  left-hand  side  describes  changes  in  temperature  over  time.  Changes  in 
heat  flux  are  captured  in  the  second  partial  of  u  that  appears  on  the  right-hand  side. 
Flux,  heat  energy  in  motion,  acts  to  equalize  temperature.  The  thermal  diffusivity, 
K,  measures  the  material’s  resistance  to  heat  flux.  That  is,  a  temperature  difference 
activates  the  flow  of  heat  but  the  speed  and  effectiveness  of  this  flow  is  moderated  by 
material  properties.  Considering  everything,  then,  the  heat  equation  can  be  stated 
in  one  (long)  sentence:  Changes  in  temperature  over  time  are  caused  by  (equal  to, 
due  to,  related  to)  changes  in  heat  flow  (moderated  or  accelerated  by  properties  of 
the  material)  and  thermal  source(s). 

2.  Notation 

With  two  or  more  dimensions,  the  same  equations  that  looked  simple  in  one 
dimension  can  begin  to  look  complex.  The  linear  operator.  A,  is  used  to  simplify 
the  notation.  For  example,  Au,  substituted  into  the  right-hand  side  of  (D.2),  gives 
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the  heat  equation  a  new  look: 


du 

dt 


kAu  +  Q{x,t) 


(D.5) 


This  is  a  more  general  equation  since  the  linear  operator  Au  can  be  applied  in  any 


number  of  dimensions.  For  instance  (in  three  dimensions), 

d^u  d^u  d^u 

”  ^  8?  ai?  aP 


(D.6) 


Sometimes  this  operator  is  called  the  Laplacian  of  u  and  some  authors  use  the  del 
operator,  V,  in  these  equations  (V^u  =  Au). 


3.  Diffusion 


The  behavior  of  thermal  energ}’  is  actually  a  special  instance  of  diffusion, 
so  (D.5)  is  often  referred  to  as  the  diffusion  equation.  With  an  appropriate  substi¬ 
tution  for  K,  the  equation  might  describe  the  spreading  of  dye  through  ocean  water. 
In  an  agricultural  application,  it  could  characterize  water  or  chemical  penetration 
in  soil.  We  shall  continue  to  use  the  term  “heat  equation”,  though,  for  the  sake  of 
consistent  terminology  and  notation. 

4.  Laplace’s  Equation 


Consider  the  effect  of  a  few  restrictions  on  the  heat  equation.  Suppose  that 
there  is  no  source  of  thermal  energy  ((^  =  0)  and  the  physical  properties  of  the 
material  do  not  vary  («  is  constant).  Finally,  what  happens  if  the  time-dependency 
is  removed? 

The  left-hand  side  of  the  equation  goes  aw’ay.  This  is  not  so  unrealistic. 
Systems  may  reach  a  steady  (equilibrium)  state  after  a  time  (especially  in  the  absence 
of  sources).  We  can  divide  through  by  k  (assuming  «  ^  0)  and  the  equation  becomes: 


d^u  d^u 


(D.7) 
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This  is  Laplace’s  equation  in  the  two  dimensions  x  and  y.  Sometimes  it  is  called 
the  potential  equation  since  it  also  describes  the  cases  in  which  u  stands  for 
gravity  or  voltage.  It  can  also  describe  “steady-state  heat  flow. . .  hydrodynamics, 
gravitational  attraction,  elasticity,  and  certain  motions  of  incompressible  fluids”. 
[Ref.  44  :  pp.  660-661] 

5.  Ellipses 

Although  Laplace’s  equation  seems  like  a  steady-state  heat  equation,  it  is 
fundamentally  different.  It  falls  in  the  elliptic  class  of  partial  differential  equations. 
Consider  an  ellipse  centered  at  the  origin  with  foci  (on  the  i-axis  at  a  distance  of  c 
from  the  origin)  located  at  (— c,  0)  and  (c, 0).  Suppose  that  the  foci  are  labeled  Fi 
and  F2-  The  major  axis  passes  through  the  center  and  through  the  foci,  connecting 
two  vertices  positioned  at  (— a,0)  and  (a,0).  The  minor  axis  pcisses  through  the 
center  perpendicular  to  the  major  axis  and  connects  the  vertices  at  (0,  —h)  and 
(0,6).  The  major  axis  deserves  its  name  since  a  >  6  (in  the  case  of  equality  the 
ellipse  degenerates  and  we  get  a  special  case — the  circle). 

For  any  arbitrary  point,  p,  let  the  distance  d\  be  the  distance  from  p  to  F\ 
and  let  dj  be  the  distance  from  p  to  F2.  Furthermore,  let  d  =  d^  +  d2.  The  ellipse 
is  described  by  all  points  satisfying  d  =  2a,  where  a  is  the  constant  length  of  the 
ellipse’s  semi-major  axis  as  described  above.  The  standard  form  for  the  equation  of 
this  ellipse  is 


Using  the  distances  from  this  ellipse,  a  right  triangle  can  be  formed  with  sides  of 
length  6  and  c  and  hypotenuse  of  length  a.  This  means  a,  6,  and  c  are  related  by  the 
Pythagorean  Theorem. 
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Figure  D.l;  The  Region 


6.  Poisson’s  Equation 

We  have  discussed  several  partial  differential  equations  and  observed  the 
impact  of  changing  a  few  parameters.  Laplace’s  equation  showed  what  happens  in 
the  steady-state  Ccise  when  sources  are  removed  and  the  thermal  diffusivity  is  non¬ 
zero.  Now  we  return  to  the  more  general  problem  that  can  be  represented  in  the 
presence  of  a  source,  sometimes  called  a  driving  (or  forcing)  function,  say  f(x,y). 

The  result  is  Poisson’s  equation  (shown  here  in  two  dimensions): 

d^u  d^u 

Again,  u{x,y)  typically  represents  temperature  or  voltage.  Laplace’s  equation  (D.7) 
is  just  the  special  case  of  Poisson’s  equation  (D.9)  where  f{x,y)  =  0.  The  rest  of 
the  discussion  will  focus  on  Poisson’s  equation  within  the  rectangular  region  (shown 
in  Figure  D.l):  0<x<L,0<y<H. 
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Figure  D.2:  Subdividing  the  Rectangle 


7,  Final  Assumptions 

We  shall  assume  that  the  conditions  along  the  boundaries  are  known  and  are 
given  by  u  =  g<x^y).  The  problem  is  solved  in  the  presence  of  a  forcing  function  /. 
The  goal  is  to  produce  something  that  a  computing  machine  can  “solve”.  To  reach 
this  position,  several  steps  are  required.  First,  the  domain  is  divided  into  many 
smaller  regions.  Using  this  subdivision  scheme,  a  system  of  equations  is  developed. 
The  information  that  is  known  (/  and  g)  can  be  moved  to  the  right-hand  side  of  the 
system.  The  system  can  then  be  represented  in  typical  Ax  =  fcftishion. 

C.  DISCRETIZATION 

Before  attempting  a  numerical  solution,  the  domain  must  be  subdivided  into  a 
finite  (but  probably  large)  number  of  elements.  Figure  D.2  provides  an  illustration 
of  what  this  mesh  looks  like.  We  should  not  forget  that  actual  applications  may 
involve  100  (or  more)  divisions  in  each  direction.  Nevertheless,  (artificially)  small 
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examples  are  quite  sufficient  for  conveying  notation  and  measures  within  the  region. 

1.  Notation 

A  clear  understanding  of  the  problem  domain,  conventions,  and  notation 
is  prerequisite  to  developing  the  system  of  equations.  Consider  Figure  D.2.  This 
domain  will  serve  as  a  reference  for  the  upcoming  discussion  on  conventions  and 
notation. 

The  rectangular  region  has  length  L  =  9  and  height  H  =  b.  It  has  been 
subdivided  into  45  smaller  elements  by  a  mesh  made  of  four  horizontal  lines  and  eight 
vertical  lines.  The  integers  m  and  n  are  used  to  keep  track  of  how  many  horizontal 
and  vertical  dividing  lines  are  used  (here  m  =  4  and  n  =  8).  Each  element  has  length 
h  (in  the  i-direction)  and  height  k  (in  the  y-direction).  In  this  particular  example, 
the  elements  are  (conveniently)  square  with  h  =  k  In  general,  the  individual 
elements  within  the  region  are  rectangular  (it  is  not  necessarily  true  that  h  =  k). 

The  elements  within  the  region  are  uniformly  spaced  (each  has  the  same 
size).  L,  H,  h,  and  k  do  not  need  to  be  integers — they  can  be  any  convenient  units. 
To  guarantee  uniform  spacing,  of  course,  L  and  H  must  be  integer  multiples  of  h 
and  k,  respectively.  That  is: 

L  =  {n  +  \)h,  n  €  {0, 1, 2, 3, . . .  } 

H  =  {m-h\)k,  m  €  {0, 1, 2. 3, . . . } 

2.  Internal  Mesh  Points 

Our  goal  is  a  system  of  equations,  and  ultimately  a  problem  stated  in  terms 
of  a  matrix  and  vectors.  We  will  eventually  see  that  there  are  mn  equations  in  mn 
unknowns,  one  for  each  internal  mesh  point  (where  the  lines  cross).  Imagine  elements 
of  size  h  X  k  (as  before)  that  are  centered  on  these  points,  such  as  the  cross-hatched 
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element  at  (7,3).  Each  equation  in  the  system  will  correspond  to  one  of  these  line- 
crossings  and  represent  one  of  these  elements.  It  is  useful  to  label  the  lines  for 
reference  purposes.  To  accomplish  this,  we  use  the  (integer)  counters  i  and  j. 

These  counters  are  used  to  reference  particular  vertical  and  horizontal  di¬ 
viding  lines.  The  i  counter  refers  to  a  vertical  line  (1  <  t  <  n)  and  the  horizontal 
lines  are  indexed  by  j  (1  <  i  <  m).  Figure  D.2  may  be  deceptively  simple  due  to 
the  element  dimensions  h  =  k  =  Because  of  this,  i  =  7  indicates  an  x-coordinate 
of  7  and  j  =  Z  means  y  =  3.  But  the  counters  i  and  j  are  not  generally  equivalent  to 
X-  and  j/-position  in  the  coordinate  system.  Given  ft,  ft,  t,  and  j  the  corresponding 
coordinates  are  {x,y)  =  {ih,jk). 

D.  A  SYSTEM  OF  EQUATIONS 

The  next  step  is  to  build  a  system  of  mn  equations  that  describes  the  problem. 
First,  we  need  to  agree  upon  a  referencing  scheme  for  the  internal  mesh  points.  The 
numbering  will  be  based  upon  t  and  j  as  defined  above.  This  numbering  scheme 
begins  at  the  bottom  left  (i.e.,  i  =  j  =  1),  proceeds  up  the  first  column  and  then 
moves,  column-by-column,  to  the  right.  Specifically,  the  points  will  be  aissigned  a 
label 

(  =  m{i  -  \)  +  j  (D.IO) 

Given  the  values  i  and  j  for  any  internal  point,  now  we  can  «issign  it  a  label 
(1  <  ^  <  mn).  Figure  D.3  shows  values  of  i  along  the  x-axis,  values  of  j 

along  the  y-axis,  and  labeling  of  internal  mesh  points  according  to  (D.IO). 

1.  Finite  Differences 

The  approach  calls  for  analyzing  each  internal  mesh  point.  Figure  D.4 
shows  the  point  referenced  by  i  and  j  and  its  neighbors  to  the  North,  South,  East, 
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Figure  D.3:  Numbering  the  Equations 

and  West.  We  use  a  centered  finite  difference  method  to  approximate  the  partial 
derivatives  in  (D.9)  and  arrive  at  the  equations  for  these  points.  The  finite  difference 
approximations  for  the  partial  derivatives  are: 

^  w,_ij  —  2t/,j  + 


(..j) 


/l2 


(D.ll) 


(D.I2) 


The  approximation  for  the  partial  derivative  in  the  x-direction  (D.ll)  con¬ 
siders  the  neighbor  to  the  West,  the  point  itself,  and  the  neighbor  to  the  Eaist. 
Similarly,  the  approximation  in  the  j/-direction  (D.12)  recognizes  neighbors  to  the 
South  and  North  in  addition  to  the  point.  Both  finite  difference  approximations 
favor  the  center  point  giving  it  twice  the  weight  of  its  neighbors. 

Substituting  these  into  Poisson’s  equation  (D.9)  yields; 

~  ~  =  -f.,  (D.13) 
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Figure  D.4:  Neighbors  to  the  North,  South,  Ecist,  and  West 

The  forcing  function,  /,j,  is  known  so  (D.13)  begins  to  look  like  one  of  many  equa¬ 
tions  in  a  linear  system.  There  is  such  an  equation  for  every  internal  mesh  point. 
To  make  sure  that  we  consider  all  of  the  internal  mesh  points  in  an  orderly  fashion, 
we  may  number  them  as  in  Figure  D.3  and  consider  them  one  at  a  time. 

2.  More  Equations 

At  this  point,  we  know  the  general  form  (D.13)  for  each  of  the  equations 
that  must  be  considered.  The  matrix  of  coefficients  may  not  be  completely  clear  yet, 
so  let  us  consider  each  of  the  equations  in  the  order  of  their  labels.  For  now,  we  will 
leave  the  i,j  subscripts  on  everything: 

Uo,i  —  2ui  1  -1-  U2,K  /*^1.0  ~  2Ui,i  -f  Uj  2.  r 

-( - - )  -  ( - p - )  «  -/i.i 


U0,2  ~  2lii,2  +  U2.2.  /«1.I  —  2Ui,2  +  Ul.3.  , 

-( - - )  -  ( - p - )  ~  -h.2 


168 


/■Uo,m— 1  2Ul,jn  — 1  "t"  1  \  2  2Ui,n>— 1  ^  ^ 

■( - - )  “  ( - p - )  ~  -/l,m-l 


/WO,m  ~  2Ui,,„  +  U2,mx  /  “  2^1,^  +  Ui,,„+1  . 

■( - ^ ^ 


/l2 


/^1,1  —  2u2  1  +  U3,1  /U2,0  ~  2u2,1  +  ^2,2,.  , 

-( - - j  -  ( - -p - )  ~  -h.i 


/*^1.2  ~  2u2,2  +  ^3,2x  /  ^2,1  ~  2w2,2  +  t^2.3x  r 

-( - - )  -  ( - - )  ~  -/2,2 


/l2 


—  2U2.m-l  +  I23,n,_l  .  U2,m-2  —  2ti2.m-l  +  ^2,^  x  ^  , 

•( - p - )  -  ( - P - )  ~ 


/^l.m  ^^2,rn  ^3,m  x  /^2,Tn  — 1  2u2.m  ^2.m  +  l  x  ^  r 

—  (  )  ~  (  )  ~  ~j2,n 


Ar2 


/  I2n_2,l  ~  2u„_i,i  +  U„,]  U„_i,o  —  2u„_],]  +  Un_],2  ,  , 

■( - n - )  -  V - Ti - ) - •'"-J'l 


/l2 


k^ 


/I2n_2,2  —  2u„_i,2  +  Un,2x  /  —  2u„_i,2  +  Un-l,3  x  ^  , 

■( - n - J  “  V - - 


*2 


it2 


2.m  — 1  2tin  — l,m  — 1  ^n.m  — 1  ^  l,m— 2  2Un  — l.m  — 1  ^n  — l.m  ^  ^  /■ 

■( - TX  )  —  I  Tt  ) - /n-l.m-I 


/l2 


k^ 
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-( 


^n— 2,m  2lin— l,jri  "I"  ^n,i 


-)-( 


^n— l,m— 1  2Un— l,m  "t"  ^n— l,m+l 

P 


)  «  -/n- 


l,rn 


/^n— 1,1  2Un,l  4“  Un+1,1  ,  211^,1  "I"  t^n,2  \ 

A  ^2  ^  ~  '  L2  / 


jb2 


^  ~/n,l 


/U„_l,2  —  2u„,2  +  U„+i,2,  /“n.l  ~  2u„,2  + 

I  /l2  ^  ^  jb2  ^ 


“/n,2 


/^n  — l,m  — 1  2u,j  m  — 1  "H  ^n+l.m  — 1  \  /^n,m—2  —  \  ^n,m  ^  j 

~y  To  i  “  V  T?  '  ~  ~Jn,m-\ 


/^n  — l,m  2linrn  "t"  l^n  +  l.m  ^  ^  ^n,m— 1  2Un,m  "t”  ^n,m+l  \  .  f 

~V  To  I  ~  \  To  /  ^  ~Jn,n 


/l2 


3.  Modification 


The  goal  is  to  determine  u,j  for  all  internal  points  (i,  j).  Having  completed 
several  foundational  steps,  we  can  see  a  developing  system  of  mn  equations.  Let’s 
clean  it  up  a  bit.  To  do  this,  we  need  to  make  better  use  of  one  more  piece  of  the 
given  information — the  boundary  values.  For  those  points  just  inside  the  boundaries 
(a  horizontal  distance  of  h  from  the  sides  and/or  a  vertical  distance  of  k  from  the 
top  or  bottom)  we  already  know  part  of  the  left  side  of  (D.13).  In  particular,  any 
subscript  j  =  0,  j  =  0,  i  =  n  +  1,  and/or  j  =  m  +  1  signifies  a  (known)  boundary 
point. 

Multiplying  through  by  {hk)^  and  moving  the  known  information  to  the 
right-hand  side  of  the  equations,  w'e  again  start  with  the  left -most  column  (i  =  1) 
and  work  in  the  order  of  the  labels.  Now  the  system  of  equations  looks  like  this: 
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^^(2ui,i  ~  “2,1)  +  ^^(2ui,i  —  01,2)  ^  +  ^^Uo,i  +  k^^i,o 


k^{2Ui  2  —  1*2,2)  "I"  2Ui,2  —  **1,3)  ^  ~{kk)^ fi^2  "I"  k^Ur\  2 


k  (2l/i,ffi_i  **2,m— 1)  k  (  t2l,t7i_2  2z/i,fjt*l  **l,m)  ^  ^/ikj  f\^m—\  k  1*0, m  —  1 


k  (2lii,m  **2,m)  *1”  ^  (  **l,m  — 1  2l*i,m)  ^  i.kk')  fi,m  "t"  k  llQ.m  "t"  ^  **l,m+l 


/r^(— i/i,j  +  2t/2,i  “  **3,1)  ^^(2t*2,i  ~  **2,2)  ~{kkY f2,i  +  /*^i*2,o 


A'^(  — t/l,2  +  2u2,2  ~  **3,2)  +  ^^(—1*2,1  +  2u2,2  ~  ^2,3)  ^  ~{kk)^ f2,2 


^^(~**l,m-l  +  2u2,m-l  ~  '>*3.ni-l)  +  ^^(~**2,tn-2  +  2u2,m-l  ~  **2,m)  ~  ~(^^')^/2,m-l 


^  (  **l,m  ”t"  2u2,m  1*3, m)  4"  ^  (  **2,m— 1  4"  2li2,m)  ~  72,771  4"  A  U2,m+1 


— t*n-2,l  4-  2u„_i,i  —  U„,i)  +  /l^(2u„_i.i  —  U„_i,2)  ^  — (/l^)^/n-l,l  4-  /l^tln-1,0 

k^{  —  Un-2.2  4-  2li„_i,2  —  **71,2)  4-  /l^(  — l*„_i.i  4-  2u„_i,2  —  Un-1,3)  ^  ~(^^')^/n-l,2 
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^  2,Tn— l,m— 1  1  (  ^n— l,m— 2"t‘2Un— l,m— 1  l,m)  ~  fn—l,m— 

^  (  Un_2,m  ■i"2Un— l,m  (  ^n— l,m— 1  "^2Un— l,m  )  ^  fn—l,m~^h  U^— l,m+l 

k^{-Un-l,-l  +  2u„,i)  +  /l^(2u„,i  -  U„.2)  RS  -(/lt)Vn,l  +  fc^«n+l,l  +  ^^Un.O 

— Un-1,2  +  2u„,2)  +  h?{  —  '^n,l  +  2u„,2  ~  ^n.s)  ^  ~(^^)^/n,2  +  k^'^n+1,2 


k  (  t/n  — l,m  — 1  ^■2u„_7,i_1  ) -|- /l  (  Un,m— 2  ^■2Un,m— 1  ^n.m)  ^  (/i/t)  fn,m  —  'l'^k  Un+l,m  — 1 

^  (  ^n— l.m  "I"  2Un,fn)  "I"  ^  — 1  "I"  2ti„^ni)  ^  (/ifc)  fn,m  ^  ^n+l,m  "I"  ^  ^n,m+l 

Now  the  equations  are  very  close  to  what  we  want.  There  are  some  unfor¬ 
tunate  side  effects  to  such  a  deliberate  approach.  The  list  of  equations  is  tedious, 
the  subscripts  are  a  bit  involved,  and  it  takes  some  concentration  to  match  things 
up.  There  are  some  benefits,  though,  for  those  who  can  endure!  It  will  take  very 
little  effort  to  see  how  the  coefficients  are  collected. 

E.  MATRIX  REPRESENTATION 

It  is  not  hard  to  translate  the  preceding  equations  into  the  familiar  representa¬ 
tion  Ax  =  6.  Notation  is  quite  important.  We  w'ill  start  with  the  obvious,  exchanging 
u  for  X  so  that  (eventually)  the  system  will  look  like  Au  =  b.  Dimensions  are  impor¬ 
tant  too.  The  goal  is  a  large,  sparse,  symmetrix  matrix  A  €  3?"*"  **  The  vectors 
u  and  b  have  the  obvious  dimensions  and  are  assumed  to  contain  real  numbers  as 
well. 
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1.  Unknowns 


Since  there  is  a  great  deal  of  structure  in  this  problem,  it  is  useful  to 
partition  the  vector  of  unknowns,  u.  Let  have  the  same  meaning  as  it  did 
in  equation  (D.13)  and  consider  the  m-vector: 

“.,1 
“..2 

w.  =  : 

^i,m— 1 

^«,m 

This  vector  captures  all  of  the  unknowns  for  a  given  column,  i,  of  the  original  region. 
Now  we  can  stack  the  columns,  n  in  number,  forming  the  entire  vector  u  of  unknowns: 

Ul 

V  = 

*^n-l 

This  process  has  clearly  formed  u  €  Now  we  turn  to  the  matrix  of  coefficients. 

2.  Coefficients 

The  matrix  A  is  formed  by  combining  two  smaller  matrices,  T  and  D.  First 

we  shall  consider  the  tridiagonal  matrix  T  6  For  aesthetic  purposes  only,  let 

the  diagonal  elements  of  T  be  </  =  2{h^  +  k^). 

■  d  -h^ 

-h^  d  -h^ 

-h^  d  -h^ 

T  = 

-h'^  d  -h? 

-h^  a  -h^ 

-h^  d 

Next,  consider  the  diagonal  matrix  D  € 
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Forming  the  matrix  A  requires  n  identical  copies  of  T  and  2(n  —  1)  identical 
copies  of  D.  The  matrices  in  A  below  are  assigned  subscripts  for  counting  purposes. 
The  matrix  subscripts,  by  the  way,  denote  a  value  of  i  corresponding  to  the  partition 


u,  which  the  matrix  will  multiply.  A  is  the  block-tridiagonal  matrix 


■  Ti  D2 
D\  T2  D3 

D2  T3  D4 

A  = 

Dn-3  7’„_2  Dn-l 
Dn-2  Tn-\ 

[  Dr.., 


3.  Knowns 


Wo  could  proceed  immediately  to  the  solution  vector,  6  G  3?"”*,  using  the 
equations  provided  in  the  previous  section.  Again,  though,  the  result  can  be  cleaned 
up  a  bit  if  we  form  b  as  the  sum  of  three  vectors  /,  i’,  w. 

The  vector  /  G  3?'""  represents  the  forcing  function.  The  equations  clearly 
indicate  where  the  scalar  multiplier  comes  from. 

/1.1 

/l,2 

!  =  -{.hkf 

/2,2 

fn,m  —  l 

fn,m 
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Next,  the  vector  v  €  3?”*"  is  used  to  represent  the  information  that  is  known 
due  to  the  boundary  values  on  the  East  and  West  sides  of  the  region. 


«0.m 

0 


*^n+1.2 

^n+l,m— 1 
^n+i, tn 


Finally,  the  vector  n’  €  3?"*"  is  used  to  represent  the  information  that  is 
known  due  to  the  boundary  values  on  the  North  and  South  sides  of  the  region. 


0 

“l.m+l 

^2,0 

W  =  0 

0 

^2,m+l 

^3,0 


1_  ^n,m+l  J 

Now  6  is  a  simple  sum  of  these  vectors:  6  =  /  +  v  +  ti>. 
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F.  CONCLUSION 


This  process  has  shown  a  few  examples  of  partial  differential  equations  that 
appear  frequently  in  nature.  Poisson’s  equation  in  two  dimensions  was  selected  as 
an  example.  After  the  finite  difference  approximation  is  selected,  determining  the 
system  of  equations  is  a  tedious  (but  not  too  complicated)  process.  Once  the  system 
of  equations  is  written  down,  the  matrix  representation  is  easy  to  come  by. 
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APPENDIX  E 

HYPERCUBE  COMMUNICATIONS 


This  report  displays  the  results  of  point-to-point  communications  tests  that 
were  performed  on  the  Intel  iPSC/2  hypercube.  The  emphasis  of  the  experiment 
was  to  evaluate  several  aspects  of  communications  time.  The  exercise  showed  that 
communication  on  this  machine  is  virtually  independent  of  the  Hamming  distance 
between  communicating  nodes.  There  is  clear  evidence  that  transmission  rates  are 
related  to  message  length  (the  transmission  system  favors  longer  messages)  due — at 
least  in  part — to  an  overhead  charged  to  begin  the  communication.  Communications 
between  the  host  and  a  node  never  achieve  the  rate  that  can  be  realized  with  node- 
to-node  transmissions. 

The  communications  test  code  described  in  this  appendix  w’as  only  executed  on 
the  iPSC/2.  Time  did  not  permit  modification  of  the  code  and  testing  on  the  trans¬ 
puter  networks.  A  thorough  test  of  communications  and  computational  abilities  of 
the  T414  and  T800  transputers  has  already  been  performed  by  Gregory  Bryant.  His 
masters  thesis  [Ref.  2G]  contains  the  documentation  of  this  work.  A  short  summary 
of  Bryant’s  findings  is  included  in  the  conclusions  to  this  appendix. 

A.  SOURCE  CODE  OVERVIEW 

The  host  program  (commtst.c)  and  a  node  program  (commtstn.c)  contain 
most  of  the  code  for  this  experiment.  There  is  also  a  header  file,  comnitst.h,  shared 
by  these  codes,.  Finally  (but  perhaps  most  important  for  any  high-level  survey  of  the 
code),  the  makefile  commtst.mak  shows  dependencies  and  compilation  procedures. 


177 


In  the  discussion  that  follows,  bold-faced  type  is  used  to  indicate  function  and  object 
names  that  actually  appear  in  the  code. 

B.  STRATEGY 

The  program  must  define  the  valid  arguments.  The  function  interpret_args() 
takes  care  of  checking  for  occurrences  of  these  arguments  in  the  command  line. 
When  the  arguments  have  been  interpreted,  we  know  how  to  set  variables  like  reps 
(repetitions),  bytes  (length  of  the  message  to  be  passed),  and  verbose  (to  control 
how  much  data  is  spewed  out).  Once  these  values  are  known,  the  host  instructs  each 
node  to  either  RECEIVE  or  SEND.  A  special  Tasking  packet  (structure)  carries 
instructions  to  each  node  independently.  Only  one  node  is  designated  to  SEND 
at  any  one  time;  the  rest  RECEIVE.  Receivers  simply  crecv()  the  given  number 
of  bytes  and  return  the  message  to  the  originator  by  calling  csend().  Since  this 
involves  a  round-trip,  the  issue  of  timing  requires  attention. 

We  can  divide  the  time  measurement  by  two  (to  account  for  the  round-trip), 
provided  we  aren’t  deceived  by  the  outcome.  That  is,  passing  two  fe-byte  messages  is 
not  the  same  as  passing  a  single  message  of  length  26  bytes.  To  make  the  timing  data 
credible,  however,  the  round-trip  method  is  essential.  The  precision  of  the  mclock() 
function  is  an  additional  issue.  At  best,  mclock()  is  accurate  to  the  millisecond  (and 
ten  milliseconds  may  be  a  more  reasonable  expectation).  Very  short  messages  can 
produce  questionable  results  in  terms  of  the  precision  of  the  timing  data. 

For  this  reason,  tests  of  short  messages  should  be  repeated  a  number  of  times 
within  the  block  surrounded  by  time  checks.  This,  of  course,  revives  the  same  issue 
(multiple  repetitions  of  a  message  are  not  equivalent  to  a  single,  longer  message). 
We  may  proceed,  however,  provided  we  establish  a  common  understanding  of  the 
problem  domain  and  terminology.  I  have  used  the  term  effective  time  to  capture  this 
subtlety. 
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Wherever  this  term  appears,  it  should  be  interpreted  according  to  the  following 
definition: 


where  tg  is  the  effective  time,  i  is  the  actual  time  measurement  for  the  message,  and  p 
is  the  number  of  repetitions.  The  factor  of  two  is  included  to  account  for  the  round- 
trip.  For  instance,  suppose  that  the  user  asks  for  three  repetitions  of  a  message.  The 
implementation  carries  this  out  in  a  for  loop.  Time  is  sampled  before  and  after  the 
loop.  The  inside  of  the  loop  is  the  simple  csend()  and  crecv()  sequence  described 
earlier.  The  effective  time  in  this  example  would  be  =  t/S. 

In  summary,  there  is  no  convenient  (and  credible)  method  for  timing  one-way 
communications.  If  w'e  time  one-way  communications,  the  results  could  be  mis¬ 
leading  in  that  we  could  not  be  certain  that  the  clock  was  starting  just  before  the 
beginning  of  the  csend()  and  stopped  immediately  after  the  receiving  node  accu¬ 
mulated  the  final  byte  of  the  message.  We  must  also  consider  the  issue  of  blocking 
communication.^  Thus,  the  (round-trip)  method  is  not  so  easily  misled  by  the  fact 
that  csend()  is  not  actually  blocking.  The  transmission  duties  are  quickly  handed 
over  to  a  communication  manager  and  processing  continues  directly.  The  crecv() 
enforces  blocking  communications  and  execution  stops  at  this  function  until  the  last 
byte  has  been  acquired.  Thus  the  round-trip  method  seems  to  be  quite  reliable, 
particularly  in  the  case  of  node-to-node  communications  (if  the  host  is  involved,  the 
results  are  less  consistent). 

Since  receiver  nodes  have  nothing  else  to  do  but  receive  and  retransmit  the 
message,  the  performance  loss  due  to  the  round-trip  method  should  be  (almost  en¬ 
tirely)  accounted  for  by  two  factors  (loosely)  placed  into  “software”  and  “hardware” 

*By  definition,  blocking  means  that  the  invoking  process  (send  or  receive)  causes  execution  of 
the  program  to  stop  (be  blocked  from  the  CPU)  until  the  communications  requirement  has  been 
satisfied. 
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categories: 

•  Software  overheads  like  establishing  and  freeing  the  activation  stack  for  functions 
(e.g.,  the  csend()  and  crecv()  functions). 

•  Hardware  overheads  aissociated  with  establishing  the  communication  path  and 
performing  switching.  The  take-down  time  for  this  task  is  probably  negligible. 

Hence,  if  this  method  of  analyzing  communications  performance  errs,  it  does  so  on 
the  conservative  side.  That  is,  the  timing  used  in  this  method  is  liberal  (if  anything), 
so  that  communication  rates  will  be  estimated  conservatively. 

C.  RESULTS 

Considering  the  nature  of  the  implementation,  communications  will  be  consid¬ 
ered  bidirectional.  In  particular,  the  term  “host-to-node”  communications  does  not 
imply  that  the  host  is  the  originator  of  directed  communication,  but  that  a  bidirec¬ 
tional  exchange  takes  place  between  some  node  and  the  host.  The  host  does  send 
directed,  one-way  instructions  to  the  nodes,  but  all  timed  communication  originates 
at  a  node  and  returns  to  that  node  (even  if  it  goes  to  the  host).  There  are  essentially 
three  groups  of  results;  each  of  which  captures  data  for  node-to-node  communica¬ 
tions  and  host-to-node  communications. 

1.  Small  Messages  Repeated  Ten  Times 

The  first  test  involved  messages  of  length  t  <  1,024  bytes.  Since  the 
shortest  of  these  would  not  generate  trustworthy  timing  data,  the  repetition  count, 
p,  was  set  at  ten.  This  gave  =  //20.  Table  E.l  shows  the  results. 
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TABLE  E.l:  SHORT  MESSAGES  WITH  TEN  REPETITIONS 
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Figure  E.l:  Speed  of  Small  Host-Node  Messages  (Ten  Repetitions) 


a.  Host-to-Node  Performance 

The  communication  rates  for  small  host-node  messages  with  a  repeti¬ 
tion  count  of  len  are  illustrated  in  Figure  E.l.  Communications  involving  the  host 
produce  very  irregular  results  (in  the  sense  that  the  relationship  between  length  and 
performance  is  not  straightforward).  The  experiment  was  executed  when  only  one 
user  was  logged  in  at  the  host  and  the  results  followed  the  same  general  pattern  on 
repeated  tests. 
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Figure  E.2:  Speed  of  Small  Messages  Between  Nodes  (Ten  Repetitions) 

b.  Node-to-Node  Performance 

In  the  absence  of  contention  for  the  communication  medium,  node- 
to-node  communications  within  the  cube  are  quite  predictable.  Figure  E.2  shows 
transmission  rates  for  small  messages  (up  to  one  kilobyte)  repeated  ten  times. 
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TABLE  E.2:  SHORT  MESSAGES  WITH  ONE  HUNDRED  REPETITIONS 


Message 

Length 

(Bytes) 

t 

(msec) 

1 

2 

4 

68.70 

8 

69.40 

16 

70.30 

32 

71.70 

64 

75.30 

128 

137.60 

192 

142.30 

256 

146.80 

320 

152.00 

384 

156.20 

448 

161.00 

512 

165.30 

576 

169.80 

640 

174.50 

704 

179.30 

G] 


Node-to- 


U 

(msec) 


.34 
.34 
0.34 
0.35 
0.35 
0.36 
0.3S 
0.69 
0.71 
0.73 
0.76 
0.78 
0.81 
0.83 
0.85 
0.87 
0.90 
0.92 
0.94 
0.96 
0.99 
1.01 


-Node 


Rate 

(kbytes/sec) 


2.85 

5.69 

11.37 

22.51 

44.45 

87.17 

166.00 

181.69 

263.53 
340.60 
411.18 
480.15 
543.48 
604.96 

662.54 
716.33 
766.87 
818.78 
863.44 
907.21 
948.41 
988.14 


II  Host-to-Node 

t 

U 

Rate 

(msec) 

(msec) 

(kbytes /sec) 

4.19 

0.23 

4.09 

0.48 

3.98 

0.98 

774.50 

3.87 

2.02 

758.30 

3.79 

4.12 

737.10 

3.69 

8.48 

721.30 

3.61 

17.33 

1020.10 

5.10 

24.51 

1007.10 

5.04 

37.24 

1007.00 

5.04 

49.65 

1004.50 

5.02 

62.22 

1013.40 

5.07 

74.01 

1043.80 

5.22 

83.83 

1152.90 

5.76 

86.74 

1335.40 

6.68 

84.24 

1419.50 

7.10 

88.06 

1688.50 

8.44 

81.43 

1869.90 

9.35 

80.22 

1520.00 

7.60 

106.91 

1070.30 

5.35 

163.51 

1061.60 

5.31 

176.62 

1048.80 

5.24 

190.69 

2.  Small  Messages  Repeated  One  Hundred  Times 

For  the  next  experiment  data  was  collected  from  runs  using  the  same  mes¬ 
sage  lengths,  but  the  repetition  count,  p,  vf&s  raised  to  one  hundred.  This  gives 
if  =  t/200,  as  shown  in  Table  E.2. 


«.  Host-Uy-Node  Performance 

Figure  E.3  gives  the  transmission  rates  corresponding  to  this  data. 
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Transmission  Rate  (kilobytes/second) 
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Figure  E.4:  Speed  of  Small  Messages  Between  Nodes  (One  Hundred  Repetitions) 


b.  Node-to-Node  Performance 

Figure  E.4  shows  the  transmission  rates  for  the  node-to-node  messages. 
This  data  may  have  important  implications.  Consider  the  transmission  of  a  matrix 
row-by-row  within  a  loop  (where  one  row  is  transmitted  each  time  through  the 
loop).  The  expected  communications  performance  is  related  to  the  number  of  bytes 
in  a  single  row  of  the  matrix,  not  the  size  of  the  entire  matrix. 
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3.  Larger  Messages 

The  final  test  considered  longer  messages  (1,024  <  f  <  262, 144)  that  were 
not  repeated.  This  gives  ie  =  t/2.  Since  the  experiment  was  performed  over  a  rather 
large  set  of  message  lengths,  the  data  is  divided  at  an  arbitrary  point.  Messages 
of  64 K  bytes  and  less  are  designated  “medium”  length  messages  and  placed  into 
Table  E.3.  Messages  of  length  128K  bytes  and  greater  are  designated  “long”  messages 
and  placed  into  Table  E.4.  There  is  no  hidden  significance  to  this  separation,  it  just 
made  for  tables  of  reasonable  length. 

The  figures  that  follow  are  based  upon  the  combined  data  of  both  of  these 
Tables.  The  host  terminates  execution  at  the  crecv()  if  we  ask  for  more  than  262,144 
bytes  in  a  single  message.  Chapter  2 — iPSC/2  C  Library  Calls — of  [Ref.  45:  pp.  2- 
16,  2-19]  explain:  “messages  to  or  from  a  host  process  are  limited  to  a  maximum 
of  256K  bytes.  There  is  no  limit  on  message  length  between  nodes.”  This  explains 
why  the  data  stops  at  that  message  size. 
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TABLE  E.3:  MESSAGES  OF  MEDIUM  LENGTH 


Message 

Node-to-Node 

Host-to-Node 

Length 

t 

te 

Rate 

t 

te 

Ra,te 

(Bytes) 

(msec) 

(msec) 

(kbytes/sec) 

(msec) 

(msec) 

(kbytes /sec) 

1024 

1.10 

909.09 

9.00 

4.50 

222.22 

2048 

mM 

1.40 

1428.57 

10.40 

5.20 

384.62 

3072 

3.70 

1.85 

1621.62 

11.90 

5.95 

504.20 

4096 

4.40 

2.20 

1818.18 

13.40 

6.70 

597.01 

5120 

5.10 

2.55 

1960.78 

14.50 

7.25 

689.66 

6144 

5.80 

2.90 

2068.97 

14.50 

7.25 

827.59 

7168 

6.50 

3.25 

2153.85 

15.50 

7.75 

903.23 

8192 

7.40 

3.70 

2162.16 

16.50 

8.25 

969.70 

9216 

8.10 

4.05 

2222.2- 

19.50 

9.75 

923.08 

10240 

8.80 

4.40 

2272.73 

18.00 

9.00 

1111.11 

11264 

9.50 

4.75 

2315.79 

18.90 

9.45 

1164.02 

12288 

10.30 

5.15 

2330.10 

19.00 

9.50 

1263.16 

13312 

10.90 

5.45 

2385.32 

19.60 

9.80 

1326.53 

14336 

11.80 

5.90 

2372.88 

20.30 

10.15 

1379.31 

15360 

12.50 

6.25 

2400.00 

21.90 

10.95 

1369.86 

16384 

13.20 

6.60 

2424.24 

22.40 

11  20 

1428.57 

17408 

13.90 

6.95 

2446.04 

23.30 

11.65 

1459.23 

18432 

14.60 

7.30 

2465.75 

24.90 

12.45 

1445.78 

19456 

15.40 

7.70 

2467.53 

24.30 

12.15 

1563.79 

20480 

16.10 

8.05 

2484.47 

27.30 

13.65 

1465.20 

21504 

16.80 

8.40 

2500.00 

27.10 

13.55 

1549.82 

22528 

17.60 

8.80 

2500.00 

27.00 

13..50 

1629.63 

23552 

18.40 

9.20 

2500.00 

27.80 

13.90 

1654.68 

24576 

19.10 

9.55 

2513.09 

29.30 

14.65 

1638.23 

25600 

19.80 

9.90 

2525.25 

29.40 

14.70 

1700.68 

26624 

20.50 

10.25 

2536.59 

30.60 

15.30 

1699.35 

27648 

21.30 

10.65 

2535.21 

30.90 

15.45 

1747.57 

28672 

22.10 

11.05 

2533.94 

33.50 

16.75 

1671.64 

29696 

22.70 

11.35 

2555.07 

38.50 

19.25 

1506.49 

30720 

23.50 

11.75 

2553.19 

37.90 

18.95 

1583.11 

31744 

24.20 

12.10 

2561.98 

37.90 

18.95 

1635.88 

32768 

24.90 

12.45 

2570.28 

38.10 

19.05 

1679.79 

65536 

48.50 

24.25 

2639.18 

59.90 

29.95 

2136.89 
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TABLE  E.4:  LONG  MESSAGES 


Message 

Node-to-Node 

Host-io-Node 

Length 

t 

te 

Rate 

t 

te 

Rate 

(Bytes) 

(msec) 

(msec) 

(kbytes/sec) 

(msec) 

(msec) 

(kbytes/sec) 

47.80 

2677.82 

109.40 

54.70 

2340.04 

BB 

54.80 

2682.48 

61.80 

2378.64 

161792 

117.70 

58.85 

2684.79 

131.60 

65.80 

2401.22 

162816 

118.40 

59.20 

2685.81 

132.90 

66.45 

2392.78 

119.10 

59.55 

2686.82 

133.60 

66.80 

2395.21 

164864 

119.90 

59.95 

2685.57 

m 

67.50 

2385.19 

165888 

120.60 

60.30 

2686.57 

In 

68.15 

2377.11 

125.00 

62.50 

26SS.00 

m 

70.40 

2386.36 

182272 

132.40 

66.20 

2688.82 

148.10 

74.05 

2403.78 

139.70 

69.85 

2691.48 

155.60 

77.80 

2416.45 

147.10 

73.55 

2692.05 

164.60 

82..30 

2405.83 

223232 

161.80 

80.90 

2694.68 

181.10 

90.55 

2407.51 

243712 

176.50 

88.25 

2696.88 

194.80 

97.40 

2443.53 

253952 

183.80 

91.90 

2698.59 

202.80 

101.40 

2445.76 

259072 

187.60 

93.80 

2697.23 

205.50 

102.75 

2462.29 

262144 

189.70 

94.85 

2699.00 

210.50 

105.25 

2432.30 
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Transmission  Rate  (kilobytes/second) 


Figure  E..5;  Speed  of  Large  Host-Node  Messages 

a.  Host-to-Node  Performance 

The  host-to-node  communication  rates  (for  large  messages)  are  illus¬ 
trated  in  Figure  E.5. 
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Transmission  Rate  (kilobytes/second) 


Figure  E.6:  Speed  of  Large  Messages  Between  Nodes 


b.  Node-to-Node  Performance 

Figure  E.6  shows  the  transmission  rates  for  the  same  long  messages 
when  passed  among  nodes  of  the  hypercube.  To  move  the  plot  of  Figure  E.6  out 
into  the  open,  a  plot  of  transmission  rate  versus  logiof^  is  shown  in  Figure  E.7. 
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TYansmission  Rate  (nicgabytcs/sccond) 


Figure  E.7:  Node-to-Node  Transmission  Rates  for  Large  Messages 
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D.  CONCLUSIONS 


One  of  the  obstacles  that  this  experiment  carefully  avoided  was  competition 
for  the  links.  Contention  for  communications  resources  may  be  inherent  in  certain 
parallel  programs.  Potential  causes  and  effects  of  contention  should  always  be  given 
due  consideration  in  the  crafting  of  a  parallel  application.  All  of  the  algorithms  that 
w'ere  tested  in  this  research  work  involved  very  structured,  regular  communications 
schemes.  An  application  with  very  random  communication  patterns  should  be  ex¬ 
pected  to  behave  very  differently.  Additionally,  the  communication  scheme  for  every 
program  in  this  work  was  designed  to  use  the  shortest  possible  path. 

The  circuit  switching  approach  has  the  disadvantage  that  a  single  message  must 
control  the  entire  path  from  origin  to  destination.  Under  a  less  controlled,  random 
pattern  of  communications  the  performance  of  the  communications  subsystem  might 
reasonably  be  expected  to  exhibit  degraded  performance.  Other  portions  of  this  the¬ 
sis  show  that  a  communication-bound  algorithm  can  experience  severe  performance 
degradation  as  well.  There  is  no  specific  claim  that  the  results  obtained  in  this 
experiment  represent  an  upper  bound  for  node-to-node  communications  within  the 
hypercube,  but  they  are  probably  good  estimates  for  an  upper  bound. 

Host-node  communication  is  slower  than  node-to-node  communication.  This 
is  not  surprising  (consider  the  physical  distances  and  materials).  In  the  absence  of 
competition  for  the  links,  node-to-node  transmission  rates  are  essentially  predictable 
for  a  given  message  length.  There  is  a  tremendous  rise  in  transmission  rate  as  message 
length  goes  from  one  byte  to  the  vicinity  of  twenty  kilobytes.  Thereafter,  smaller 
(apparently  asymptotic)  performance  gains  are  achieved  by  increasing  the  message 
size.  A  similar  phenomenon  occurs  with  host-node  communications  but  it  takes 
much  longer  messages  to  break,  say  the  two  megabytes-per-second  transmission 
rate. 
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These  performance  measures  are  quite  appealing  for  long  messages,  but  con¬ 
sider  transmissions  of  shorter  (and  possibly  repetitious)  messages.  The  data  shows 
that  short  messages  are  penalized,  even  if  they  are  part  of  a  loop  that  involves  a 
good  deal  of  communication.  Each  instance  of  csend()  or  crecv()  is  distinct  and 
incurs  its  own  start-up  cost.  This  is  an  important  note  for  anyone  considering 
transmission  of  the  rows  (or  columns)  of  a  matrix  within  a  loop  structure.  The 
potential  of  (pre-transmission)  storage  of  matrices  (two-dimensional  arrays)  into 
one-dimensional  arrays  might  be  investigated  as  a  means  of  increasing  the  commu¬ 
nications  rate  (provided  the  cost  of  copying  the  array  is  not  prohibitive). 

Communications  in  a  transputer  network  was  not  developed  in  this  work,  but 
Bryant  [Ref.  26]  gives  a  very  thorough  analysis  of  communications  and  calculations 
in  a  network  of  transputers.  On  pages  31-34,  Bryant  gives  a  good  summary  of 
unidirectional  and  bidirectional  data  transfer  rates.  He  discusses  link  interaction  (i.e., 
how  communications  performance  varies  as  one,  two,  or  all  four  of  the  transputer’s 
links  are  engaged  in  communication)  on  pages  34-38  and  concludes  that  the  effects 
of  link  interaction  are  minimal. 

Bryant  also  discusses  the  effects  of  varied  communication  loads  on  processor 
performance.  On  pages  38-44,  he  finds  that  bombarding  a  transputer  with  many 
small  messages  while  it  is  trying  to  perform  calculations  can  severely  degrade  the 
processor’s  performance.  His  Figures  3.8  and  3.9  show  that — with  only  one  link 
active — messages  of  size  100  bytes  and  larger  cause  negligible  performance  degrada¬ 
tion.  With  all  four  links  active,  messages  of  size  greater  than  one  kilobyte  should  be 
used  to  free  the  processor  from  most  of  the  communications  overhead. 

Pages  36  and  37  of  Bryant’s  thesis  show  the  effects  of  message  length  on  the 
communication  rate.  Bryant’s  Figures  3.4  and  3.5  are  quite  similar  to  Figure  E.6 
above,  but  the  transputers  are  much  more  responsive  (i.e.,  there  seems  to  be  less 
overhead  involved,  so  the  peak  communications  rate  is  achieved  much  earlier).  In 
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fact,  the  transputers  are  near  their  peak  transmission  rate  with  messages  of  100  bytes 
and  messages  of  one  kilobyte  and  greater  always  travel  at  peak  rates. 

Comparing  a  transputer  system  to  an  iPSC/2  system — in  terms  of  communi¬ 
cations  performance — is  essentially  a  lesson  in  the  differences  betw'een  store-and- 
forward  switching  versus  circuit  switching  for  multi-hop  communications.  Bryant 
shows  [Ref.  26 ;  pp.  83-85]  that  the  store-and-forw’ard  transmission  rates  suffer  as 
the  number  of  hops  grows.  The  direct-connect  (circuit  switching)  approach  recovers 
its  overhead  on  multi-hop  communications,  but  it  ties  up  the  entire  path  to  do  so 
(making  it  unavailable  to  other  potential  users).  The  key  difference  is  that  commu¬ 
nications  performance  with  the  direct-connect  method  is  very  nearly  independent  of 
the  number  of  hops. 

The  transputer  system  seems  to  enforce  true  blocking  communications  on  both 
the  sending  and  receiving  ends  (byte-by-byte  acknowledgment  is  part  of  the  pro¬ 
tocol).  The  iPSC/2  csend()  is  not  blocking,  but  the  crecv()  function  is  blocking. 
Proper  handling  of  these  issues  can  become  important  when  implementing  an  algo¬ 
rithm.  Each  method  hcis  advantages  and  disadvantages,  but — at  least  for  the  current 
systems — transputers  seem  better  suited  for  applications  involving  short  messages 
over  short  distance  and  the  iPSC/2  seems  to  handle  long  messages  over  long  distances 
better. 

E.  SOURCE  CODE  LISTINGS 

The  source  code  listings  for  the  programs  used  for  these  tests  are  supplied  on 
the  pages  that  follow.  The  makefile  commtst.mak  appears  first  and  describes  the 
dependencies  among  the  files  and  compilation  procedures.  Next,  commtst.h  is  the 
header  file  associated  with  these  programs.  Finally,  the  actual  code  is  given  in  a  host 
program  called  commtst.c  and  the  node  program  commtstn.c. 
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comintst.inak 


1  #  Author:  Jonathan  E.  Hartman,  U.  S.  laval  Postgraduate  School 

2  #  Purpose;  Makefile  for  Hypercube  Communications  Test  Programs 

3  t  Date:  07  August  1991 

4 

5  all:  hostcode  nodecode 

6 

7  help: 

8  chelp 

9 
10 

11  # - 

12  hostcode:  commtst.o  clargs.o 

13  cc  clargs.o  commtst.o  -host  -o  commtst 

14 

15  clargs.o:  clsirgs.h  clargs.c 

16  commtst.o:  commtst. h  commtst.o 

17 

18 

19  # - 

20  nodecode:  commtstn.o 

21  cc  commtstn.o  -node  -o  commtstn 

22 

23  commtstn.o;  commtstn.o  commtst. h 

24 

25 

26  #  Execute  it!  - 

27  nin:  all 

26  commtst  -d  3  -b  1024  -r  2 

29 

30 

31  #  Delete  object  files,  executables  - 

32  clean: 

33  rm  *.o 

34  rm  commtst 

35  Z18  commtstn 

36 

37  #  EOF  commtst.maik - 
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commtst.h 


1  /*  - __________  program  IIFORMATIOM  ========r= - 

2  ♦ 

3  •  SOURCE  conuntst.h 

4  *  VERSION  1.2 

5  *  DATE  07  August  1991 

6  *  AUTHOR  Jonathan  E.  Hartman,  U.  S.  laval  Postgraduate  School 

7  e 

8  * 

9  4,  - ______________  description  ============== - 

10  * 


11  *  This  header  file  gives  common  information  for  use  across  the  host  program 

12  *  commtst.c  and  the  node  program  cosuststn.c.  A  more  complete  description 

13  *  can  be  found  in  commtst.c. 

14  • 

16  */ 

17 


18 

tifndef 

EX I T_ FA I LURE 

19 

•define 

EXIT_FAILURE 

-1 

20 

•endif 

21 

22 

•def ine 

MAX.CUBESIZE 

16 

23 

24 

•define 

ROOT 

-1 

25 

26 

•define 

RECEIVE 

0 

27 

•define 

SEND 

1 

28 

29 

•define 

FALSE 

0 

30 

•define 

TRUE 

1 

31 

32 

33  /♦ - ============  type  definition  ============ - 

34  ♦ 

35  *  The  following  structure  is  the  framework  that  the  root  processor  (host) 

36  *  uses  to  pass  instructions  to  the  worker  nodes  in  the  cube. 

37  */ 

36 

39  typedef  struct  { 


40 

41 

int 

task; 

/* 

choose  RECEIVE  or  SEND  as  above 

*/ 

42 

long  bytes; 

/* 

length  of  message 

*/ 

43 

long  reps; 

/• 

number  of  repetitions 

*/ 

44 

45 

int 

destination [NAX_CUBESIZE]  ; 

/• 

for  senders:  identifies  addressees 

•/ 

46  }  Tasking; 

47 

48 

49  /* - ============  EOF  commtst.h  ============ - ♦/ 
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7 

8 
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11 

12 

13 

14 
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PROGRAM  IIFORNATIOI 


SOURCE 

conmtst.c 

VERSIOM 

1.2 

DATE 

07  August  1991 

AUTHOR 

Jonathan  E.  Hartman,  U.  S.  laval  Postgraduate  School 

USAGE 

conmtst  [-d  dimension]  [-b  bytes]  [-r  repetitions]  C-v] 

EXAMPLE 

If  you  type  'commtst  -d  3  -v  -b  1024  -r  10’,  it  means  to 
run  the  program  on  a  dimension  3  hypercube  in  the  verbose 
mode,  sith  messages  of  length  1024  bytes,  and  10  repeti¬ 
tions  for  each  nessage. 

REFERENCES 

[1]  iPSC/2  Programmer's  Reference  Manual 

DESCRIPTIOH 


20 

21 

22 

23 

24 

25 

26 
27 


This  program  runs  on  the  host .  It  orchestrates  various  point-to-point 
communication  tasks  betveen  nodes  of  a  hypercube.  The  time  of  round-trip 
communications  is  gathered  and  printed  out.  The  output  includes  the  tine 
required  and  rate  of  communication  (taking  into  account  repetitions  and 
round-trips) .  The  'verbose’  node  gives  a  more  detailed  node-by-node 
accounting  oi  the  run. 


28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 


char  eversion  =  "Hypercube  Communications  Test,  Version  1.2”; 


ALGORITHM 


The  root  (host)  processor  determines  mho  will  connunicate  vith  shorn,  and 
shen.  Ho  node  operates  independently.  The  host  identifies  a  sender  and 
receiver(s).  The  host  also  gives  the  length  of  the  message  that  should 
be  passed  and  the  number  of  tines  that  the  message  is  to  be  repeated 
(multiple  repetitions  nay  be  required  shen  the  message  is  short  since 
nclockO  returns  milliseconds).  The  'Tasking'  structure  holds  instruc- 
from  the  manager  (i.e.,  SEID  or  RECEIVE,  the  length  of  the  nessage,  num¬ 
ber  of  repetitions,  and  addressees).  When  this  structure  is  received  at 
a  node,  it  performs  the  task  and  asaits  further  instructions  from  the 
manager  processor.  If  the  processor  is  a  sender,  it  returns  timing  data 
to  the  host  upon  completion. 
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commtst.c 


51  finclude  <stdio.h> 

53  finclude  "commtst.h" 

53  finclude  "ipsc.h" 

54  finclude  "macros. h" 

55  finclude  "clargs.h" 

56 


57 

fdeline 

ASCII.COHVERSIOI  48  /*  lor  char  ->  int  conversion  ol  0...3 

•/ 

58 

fdeline 

CT_SIZE  4  /♦  lor  cubetype  □  size 

•/ 

59 

60 

f def ine 

■UM.ARGS  4  /*  -d  -b  -r  -v 

•/ 

61 

fdef ine 

DIM  0  /*  index  values  into  optvG 

•/ 

62 

f del ine 

BYTES  1 

63 

fdel ine 

REPS  2 

64 

fdef ine 

VERBOSE  3 

65 

66 

/ 

/* 

68 

69 

fildel  PROTOTYPE 

70 

71 

void  init(int  argc,  char  **argv,  char  cubetype[CT_SIZE] , 

72 

int  vdim,  long  ebytes,  long  *reps,  int  Werbose) 

73 

74 

felse 

75 

76 

void  init(argc,  argv,  cubetype,  dim,  bytes,  reps,  verbose) 

77 

78 

int  argc ; 

79 

char  •*argv. 

80 

cubetype [CT_SIZE] ; 

81 

int  *dim; 

82 

long  •bytes. 

83 

•reps ; 

84 

int  everbose; 

85 

86 

fendil 

87 

i 

88 

int 

count  =  1 , 

89 

valid  =  FALSE; 

90 

91 

Opt_Struct  *optv [BUM. ARCS] ; 

92 

93 

94 

/* 

The  lirst  step  is  to  make  a  table  ol  all  ol  the  valid  arguments 

The 

95 

structure  is  delined  more  carelully  in  clargs.h,  but  the  basic 

idea  is 

96 

* 

that  ve  have  an  array  ol  pointers  to  type  Opt.Struct  (option  structure) 

97 

* 

...in  this  case,  there  are  lUM.ARGS  valid  arguments  and  the  next  leu 

98 

steps  take  care  ol  allocation  and  delinition  ol  them.  Vhen  this  is 

99 

done,  it  is  time  to  call  interpret_args()  to  see  shat  the  user 

entered. 

100 

*/ 
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commtst.c 


101 

102  optvCDIM]  =  (Opt_Struct  *)  rallocC  1,  sizaof (Opt .Struct)  ); 

103  optv[BYTES]  =  (Opt.Struct  •)  calloc(  1,  a izaofC Opt .Struct)  ); 

104  optvCREPS]  =  (Opt.Struct  ♦)  calloc(  1,  a izaof (Opt.Struct)  ); 

105  optvCVERBOSE]  =  (Opt.Struct  *)  calloc(  1,  aizaof (Opt.Struct)  ); 

106  optv[DIM]->laiisver  =  (long  *)  calloc(  1.  aizaol(long)  ); 

107  opt V [BYTES] ->lansser  =  (long  •)  calloc(  1,  aizaof (long)  ); 

108  optv [REPS] ->lan8Har  =  (long  ♦)  calloc(  1,  aizaof (long)  ); 

109 

no  /*  The  intel  compiler  didn’t  like  ...->argname  =  "-d";  etc.  *i 

111  optv [DIM] ->argnaBe[0]  = 

112  optv [DIM] ->argname[l]  =  ’d’; 

113  optv [DIM] ->8ubargc  =  1; 

114  optv [DIM] ->8Ubargi  =  lEXT.LOIG; 

115 

116  optv [BYTES] ->argname[0]  = 

117  optv [BYTES] ->argname[l]  *  'b'; 

118  optv [BYTES] ->8ubargc  =  1; 

119  optv [BYTES] ->8ubargi  =  lEXT.LOIG; 

120 

121  optv [REPS] ->aTgname[0]  = 

122  optv [REPS] ->argname[l]  =  *r'; 

123  optv [REPS] ->8ubargc  =  1; 

124  optv [REPS] ->subargi  *  lEXT.LOIG; 

125 

126  optv [VERBOSE] ->argname[0]  = 

127  optv  [VERBOSE] ->2urgname[l]  =  ’v’; 

128  optv [VERBOSE] ->subargc  =  0; 

129 

130  *dim  =  -1; 

131 

132  interpret_args(argc ,  argv,  lUM.ARGS,  optv); 

133 

134  if  (optv [DIM] ->found)  adim  =  (int)  optv [DIM] ->lan8ver [O] ; 

135 

136  avitch  (adim)  { 

137 

138  case  0  :  case  1  :  case  2  :  case  3  :  break; 

139 

140  default: 

141  while  (Ivalid)  { 

142 

143  printf ("Enter  desired  cube  dimension  (in  {0.  1,  2,  3}):  "); 

144  scant ('"/,d",  dim); 

145  fflush(stdin) ; 

146  8Bitch(adim)'{ 

147  case  0  :  case  1  :  case  2  :  case  3  :  valid  =  TRUE;  break; 

148  } 

149  } 

150  >  /a  end  switchO  a/ 
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151  it  ( opt V [BYTES] ->lound)  *byt«8  =  optv [BYTES] ->lan«¥er [0] ; 

152 

153  valid  =  FALSE; 

154 

155  it  (*byte8  <  1)  { 

156  vhile  (! valid)  < 

157  printf ("Ent«r  ■•88age  length  (bytaa) :  ”); 

156  scan! ('"/.Id",  bytes); 

159  lYlush(8tdin) ; 

160  if  (ebytes  >  0){  valid  =  TRUE;  > 

161  else  {  printf ("Message  length  aust  be  positive. \n") ;  } 

162  } 

163  } 

164 

165  if  (optv  [REPS] ->f>.urd)  {  *rep8  =  optv  [REPS] ->lem8Her  [0]  ;  > 

166  else  { 

167 

168  printf ("Eon-existing  (or  invalid)  repetition  count,  "); 

169  printf ("using  one  repetition.\n\n") ; 

170  •reps  =  1; 

171  } 

172 

173  (optv [VERBOSE] ->found)  ?  •verbose  =  TRUE  :  everbose  =  FALSE; 

174 

175  cubetype[0]  =  ’d’;  /•  for  diaension  (to  follow)  •/ 

176  cubetype[l]  =  (char)(edim  ♦  ASCII.COIVERSIOI) ; 

177  cubetype[2]  =  ’f’;  /•  aeans  nodes  are  386/387  coabo  •/ 

176  cubetype[3]  =  0; 

179 

160  printf ("Initialization  complete. . .Cube  Diaension;  %d\n",  edim) ; 

161  printf ("  Message  Length:  Xld\n",  kbytes); 

182  printf  ("  Repetitions;  y.ld\n\n",  ereps) ; 

163  if  (everbose)  printf ("  Verbose  Mode:  01"); 

184  } 

185  /•  End  initO  - •/ 

166 

187 

186 

189  #ifdef  PROTOTYPE 

190 

191  aain(int  argc,  char  •argvG) 

192 

193  felse 

194 

195  aain(argc,  argv) 

196 

197  int  argc; 

198  char  •argv[]; 

199 

200  iendif 


201 


commtst.c 


201  {  /♦  begin  main()  ♦/ 

202 

203  char  *cubenane  =  "Hypercube" , 

204  cubetype [CT_S1ZE]  , 

205  *iosg , 

206  enodecode  =  "coutstn"; 

207 

208  float  avg, 

209  avg_ho8trate , 

210  avg.hosttime, 

211  avg_rate, 

212  avg_time, 

213  bytes, 

214  reps; 

215 

216  int  cubesize, 

217  dim, 

218  i , 

219  j  , 

220  verbose; 

221 

222  unsigned  long  **timing_data; 

223 

224  Tasking  task.packet; 

225 

226 

227  print! ("\n'/.s\n\n" ,  version); 

228 

229  initCargc,  argv,  cubetype,  ftdia,  ft(task_packet. bytes) , 

230  ft(task_packet.reps) ,  tverbose); 

231 

232  bytes  =  (float)  task.packet .bytes ; 

233  reps  =  (float)  task.packet .reps; 

234  bytes  *=  (2.0  ♦  reps);  /♦  account  lor  two-way  communications,  reps  */ 

235 

236  cubesize  =  P0W2(din); 

237 

238  timing. data  =  (unsigned  long  **)  calloc(cubesize,  sizeof (unsigned  long*)); 

239 

240  lor  (i  =  0;  i  <  cubesize;  i**)  { 

241 

242  timing.data[i]=(unsigned  long*)calloc(cube8ize, sizeof (unsigned  long)); 

243  > 

244 

245  if  (!(msg  *  (char  *)  calloc(task_packet. bytes,  sizeol(char))))  { 

246 

247  print! ("mainO ;  Allocation  failure  for  msg.\n"): 

248  exit(EXIT.FAILURE) ; 

249  } 

250 
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commtst.c 


251 

252 

253 

254 

255 

256 

257 

258 

259 

260 
261 
262 

263 

264 

265 

266 
267 
26s 

269 

270 

271 

272 

273 

274 

275 

276 

277 

278 

279 

280 
281 
282 

283 

284 

285 

286 

287 

288 

289 

290 

291 

292 

293 

294 

295 

296 

297 

298 

299 

300 


I*  Get  the  cube  and  load  the  node  code  */ 

get cube ( cubeneune ,  cubet^pe,  lULL,  0); 
attachcube(cubenaine) ; 
setpid(O) ; 

loadCnodecode,  ALL.IODES,  lODE.PID); 


/*  Perform  the  tasking,  receive  the  aessage,  return  it,  receive  and  print 

*  timing  data. . .repeat  for  all  players.  The  outer  loop  index,  i,  sill 

*  represent  the  sender  node.  The  j  index  runs  the  other  (RECEIVE) 

*  players . 

*/ 

for  (i  =  0;  i  <  cubesize;  i++)  { 

/*  Get  the  receivers  ready  first  */ 
task_packet . task  =  RECEIVE; 

task_packet . destination [0]  =  i; 

task_packet .destinationCl]  =  cubesize;  /*  impossible  flags  end  */ 
for  ()  =  0;  j  <  i;  j4-+)  { 

csend(0,  ktask.packet ,  sizeof (Tasking) ,  j,  lODE.PID) ; 

} 

for  (j  -  (i  +  l);  j  <  cubesize;  j4’4-)  { 

csend(0,  ttask.packet ,  sizeof (Tasking) ,  j,  lODE.PID) ; 

> 

/*  Then  prepso-e  the  sender  ==>  he  can  start  */ 
task.packet . task  =  5EBD; 

for  (j  =  0;  j  <  i;  j4+)  task.packet .destinationCj]  =  j; 

task_packet .destinationCi]  =  ROOT; 

for  (j  =  (i+l);  j  <  cubesize;  j+4-)  task_packet.destination[j]  =  j; 

ceend(0,  fttask.packet ,  sizeof (Tasking) ,  i,  lODE.PID) ; 

/*  Receive  from  the  sender  and  return  his  message  e/ 
for  (j  =  0;  j  <  task.packet.reps;  j+4)  t 

crecv(AIIY_TyPE,  meg,  task.packet .bytes) ; 
csend(0,  msg,  task.packet .bytes ,  i,  lODE.PID); 

} 

/*  Receive  the  timing  data  from  this  run  and  print  it  •/ 
crecv(ANY_TYPE,  timing_data[i] ,  (cubesize  *  sizeof (unsigned  long))  ); 

}  /*  end  for  (i)  •/ 
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343 

344 

345 

346 

347 

348 
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350 


lor  (i  =  0;  i  <  cubesize;  i++)  ■( 
il  (verbose)  { 

pr inti ("Source  Dest.  Tiae  (asec)  Rate  (kilobytes/second)\n") ; 
print! ("======  =====  ===========  ==================r====\n"} ; 

printl("'/.4d  HOST  %101u  i,  tiaing_data[i]  [i]  } ; 

printK"  y,10.21\n",  (bytes  /  ((lloat)  tiaing_data[i]  [i] ))  ); 


avg  =0.0; 

lor  (j  =  0;  j  <  cubesize;  j+4-)  •{ 
il  (i  •=  j)  { 

avg  4-=  (lloat)  tiaing.dataCi]  [j]  ; 
il  (verbose)  { 

printK"  %4d",  j); 

printK"  */,101u  ",  tiaing.dataCi]  [j] ) ; 

printK "‘/,10.21\n",  (bytes  /  ((lloat)  timing.dataCi]  [j] ))  ); 

> 

} 


il  (j  ==  (cubesize  -  1))  i 

avg  /=  (lloat)  cubesize  -  1; 
il  (verbose)  { 


printK"============================================"); 

printK"==========\n") ; 

printK  "Averages . 7.9.11  asec  ",  avg); 


printK"  7.7.21",  bytes/avg  ); 
pr inti ( "  kby t  es/sec\n\n\n" ) ; 

} 

> 

}  /♦  end  lor(j)  */ 

>  /•  end  lor(i)  ♦/ 

lor  (i  =  0;  i  <  cubesize;  i++)  •( 

lor  (j  =  0;  j  <  cubesize;  j+4-)  { 


(i  ==  j)  ?  avg.hosttiae  ♦=  tiaing_data[i] [j]  : 

avg.tiae  +=  tiaing_data[i] [j]  ; 
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351  avg.hosttime  /=  cubesize; 


352  avg_ho8trate  -  bytas/avg.hoBttiaa; 

353 

354  avg_time  /=  ((cubesize  -  1)  ♦  cubesize); 

355  avg_rate  =  bytes/avg.tiae; 

356 

357  printf(''Il  ve  average  all  oi  the  tiaes  and  rates ....  \n\n'' ) ; 

358  printfC  Average  Tiae:  %9.11  aillisecondsXn" ,  avg.tiae); 

359  printK"  Average  Rate:  %10.21  kilobytas/second\n\n\n" ,  avg.rate); 

360 

361  printf ("NOTE:  Average  and  Rate  values  are  for  the  nodes  0ILY.\n"); 

362  printfC  They  do  not  include  the  host  tiaing  data.\n\n\n”) ; 

363 

364  printf ("The  averages  for  the  node  < — >  host  coaaunications  sere:\n\n"); 

365  printf  ("  Average  Time;  7.9. If  aillisecondsXn" ,  avg.hosttiae) ; 

366  printfC  Average  Rate:  Xl0.2f  kilobyte8/8econd\n\n\n” ,  avg.hostrate) ; 

367 

368  } 

369  /• - ============  EOF  coaatst.c  ============ - */ 


205 


commtstn.c 


1 

2 

3 

4 

5 

6 

7 

8 
*9 
10 
11 
12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 
27 


PROGRAM  IIFORNATIOI 


SOURCE 

VERSIOH 

DATE 

AUTHOR 


commtstn.c 

1.2 

07  August  1991 
Jonathan  £.  Hartman, 


U.  S.  laval  Postgraduate  School 


DESCRIPTIOI 


This  program  is  loaded  by  commtst.c  (mhich  runs  on  the  host).  This  code 
(commtstn.c)  runs  on  the  nodes  of  a  hypercube  created  by  the  host  program. 
For  more  information,  see  commtst.c. 


tinclude  <stdio.h> 
finclude  "commtst.h" 
tinclude  "ipsc.h" 

tdefine  SUCCESS  0 


tifdef  PROTOTYPE 


28 

29  mainCint  argc,  char  *argv[]) 

30 

31  telse 

32 

33  mainCargc,  argv) 

34 

35  int  argc ; 

36  char  ♦eurgv  []  ; 

37  tendif 

38  ( 

39  char  *msg ; 

40 

41  int  cubesize  =  numnodesO, 

42  i, 

«  j, 

44  return.addr; 

45 

46  long  rep ; 

47 

48  unsigned  long  start,  *timing_data; 

49 

50  Tasking  task.packet; 
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54 

55 

56 

57 
56 

59 

60 
61 
62 

63 

64 

65 

66 

67 

68 

69 

70 

71 

72 

73 

74 

75 

76 

77 

78 

79 

80 
81 
82 

83 

84 

85 

86 

87 

88 

89 

90 

91 

92 

93 

94 

95 

96 

97 

98 

99 
100 


timing.data  =  (unsigned  long*)  calloc(cube8ize,  sizeol (unsigned  long)); 
lor  (i  =  0;  i  <  cubesize;  i++)  { 

crecv(ANY_TYPE,  ttask.packet,  eizeof (Tasking)) ; 

Bsg  =  (char  *)  calloc(task_packet. bytes,  sizeol(char)) ; 
ssitch  (task.packet .task)  { 
case  RECEIVE  : 

return.addr  =  task.packet. destination [0] ; 

lor  (rep  =  0;  rep  <  task.packet.reps;  rep++)  { 

crecv(AKY_TYPE.  BSg.  task.packet .bytes) ; 

csend(0,  asg,  task_packet. bytes,  retum.addr,  lODE.PID); 

> 

break; 

case  SEND  : 
j  =  0; 

while  ((j<cubesize)ftft(task_packet.destination[j]<cubesize))  { 
start  =  BclockO; 

lor  (rep  =  0;  rep  <  task.packet . reps ;  rep++)  { 

(j  ==  BynodeO)  ? 

csend(0  Bsg, task.packet .bytes ,Byhost() .lODE.PID) : 
C8end(0,  Bsg,  task.packet. bytes,  j,  lODE.PID); 

crecv(AIY.TYPE,  Bsg,  task.packet. bytes ) ; 

} 

tining.dataCj]  =  BclockO  ~  start; 

> 

/*  Return  the  tiaing  data 

csend(0,  timing.data,  (cubesize  *  sizeol (unsigned  long)), 
myhostO,  lODE.PID); 

break; 
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101  default  : 

102 

103  printf ("Unrecognized  task  at  node  Xld.\n",  aynodeO  ); 

104  exit (EXIT.FAILURE) ; 

105 

106  >  /•  end  SBitchO  */ 

107 
lOS 

109  free(msg); 

110 
111 

112  }  /*  end  forO  ♦/ 

113 

1 1 4  r etuzn  ( SUCCESS ) ; 

115 

116  > 

117  /♦ - ===========  EOF  coantstn.c  =========== - ♦/ 
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APPENDIX  F 
MATRIX  LIBRARY 


This  appendix  contains  part  of  the  matrix  library,  matlib  that  is  often  used 
and  referenced  in  other  sections  and  code.  It  could  be  argued  that  “matrix  library” 
is  a  misnomer  since  much  of  the  code  has  little  to  do  with  matrices.  This  criticism  is 
true,  but  I  will  defend  the  name  since  the  entire  reason  for  the  creating  such  a  library 
was  to  handle  matrices  in  a  more  reasonable  way.  The  last  section  of  this  appendix 
contains  all  of  the  source  code  for  Gauss  factorization  with  partial  pivoting,  and  a 
short  excerpt  from  the  complete  pivoting  code. 

The  specifications  and  a  portion  of  the  source  code  for  the  library  are  given  on 
the  pages  to  follow.  The  original  intent  was  to  include  the  source  code  in  its  entirety, 
but  this  would  require  more  than  double  the  current  number  of  pages  so  the  source 
has  been  omitted.  The  files  are  divided  into  three  logical  groups: 

1.  Makefiles  that  simplify  maintenance  of  the  library,  show  dependencies  among 
the  files,  and  describe  the  compilation  procedures  that  are  used  to  generate  the 
loadable  (executable)  code. 

2.  Standard  files  (mostly  C  header  files)  that  make  definitions  available  (for  con¬ 
sistency)  across  a  wide  range  of  files.  The  range  is  implied  by  the  content  of 
the  file.  These  files  include  manifest  constants  that  are  installed  using  the  C 
Preprocessor  #def  ine  directive,  type  definitions  that  are  intended  for  use  across 
several  files,  and  macro  definitions  that  are  expanded  by  the  C  Preprocessor. 

3.  Source  code  files  that  appear  in  pairs,  like  filename. h  and  filename. c  or  (mostly) 
M  a  header  file  alone.  The  header  file  gives  remarks,  definitions  of  manifest  con- 
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slants,  type  definitions,  and  function  declarations  (specifications)  that  pertain  to 
the  associated  source  code  (i.e.,  the  code  within  filename.c).  Again,  the  latter 
has  been  omitted  in  most  cases. 

4.  The  Gauss  factorization  code.  All  of  the  source  code  for  the  partial  pivoting 
version  is  given,  and  an  excerpt  of  the  pivot  election  function  from  the  complete 
pivoting  code  is  also  provided. 

A.  MAKEFILES 

logc.mak  This  makefile  is  a  standard  template  for  programs  compiled  with  the 
Logical  Systems  C  (version  89.1)  product. 

matlib.mak  This  makefile  is  used  to  translate  matlib  into  a  useable  form.  With 
Logical  Systems  C,  it  creates  a  library  suitable  for  installation  and  use  as  any 
other  normal  C  library.  The  portion  of  the  makefile  used  on  the  Intel  iPSC/2 
simply  works  in  the  current  directory  to  translate  the  source  into  object  code  so 
that  other  programs  can  reference  it. 
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logc.mak 


1  « - 

2  » 

3  #  AUTHOR  Jonathan  E.  Hartnan.  U.  S.  laval  Postgraduate  School 

4  #  PURPOSE  :  Makefile  for  Bypercube  Coaaunications  Test  Programs  (LogC) 

5  «  DATE  10  August  1991 

6  « 

7  « - 

8 

9  R00TC0DE=f ilename 

10  lODECODE^f ilename 

11  IIF_FILE=f ilename 


12 

13 

14  # -  OPTIOIS  AID  DEFIIITIOIS  - 

15  # 

16  #  The  following  section  establishes  various  options  and  definitions.  Ve 

17  #  start  with  PP,  the  Logical  Systems  C  Preprocessor.  The  '-dX’  option 

18  *  (with  no  macro.expression)  is  like  'tdefine  XI’.  lext  ue  set  up  the 

19  #  compilation  options  for  Logical  Systems’  TCX  Transputer  C  Compiler.  The 

20  #  ’-c’  means  compress  the  output  file.  The  options  beginning  uith  ‘-p’ 

21  #  tell  TCX  to  generate  code  for  the  appropriate  processor: 

22  # 

23  #  -p2  T212  or  T222  v. 

24  #  -p2S  T226 

25  #  -p4  T414 

26  #  -p45  T400  or  T425 

27  «  -p8  T800 

28  #  -p85  T801  or  T805 


29  * 

30  #  Logical  Systems’  TASK  Transputer  Assuibler  is  next.  The  ‘-c’  means 

31  #  compress  the  output  file  (it  can  cut  it  in  half)!  The  '-f  is  used 

32  #  because  the  input  to  TASK  sill  be  from  a  language  translator  (TCX’s 

33  #  output)  and  not  from  assembly  source  code. 

34  # 

35  #  The  final  list  tells  TLIK  which  libraries  to  look  at  during  linking. 

36  #  It  also  establishes  an  entry  point.  You  should  always  use  .main  for 

37  #  the  root  node;  otherwise  use  .ns.main  (for  other  nodes). 

38 

39  PP0PT2=-dPR0T0TYPE  -dTRAISPUTER  -dT212 

40  PPOPT4=-dPROTOTYPE  -dTRAISPUTER  -dT414 

41  PP0PT8=-dPR0T0TYPE  -dTRAISPUTER  -dT800 

42  TCX0PT2=-cp2 

43  TCX0PT4=-cp4 

44  TCX0PT8=-cp8 

45  TASM0PT*-ct 

46  T2LIB=t21ib.tll 

47  T4LIB=matlib4.tll  t4cube.tll  t41ib.tll 

48  T8LIB=matlib8.tll  tBcube.tll  t81ib.tll 

49  REITRYs.main 

50  lEITRYs.ns.main 
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51 

52 

53  « -  DEFAULT  ===>  MAKE  ALL 

54  * 

55 

56  all:  |(ROOTCODE).tld  KIODECODE)  .tld 

57 

58 

59 

60 
61 

62  # -  ROOT  CODE  - 

63  « 

64 

65  t(ROOTCODE):  S(ROOTCODE) .tld 

66 

67  fCROOTCODE) .tld:  S(ROOTCODE) .trl 


68 

echo  FLAG 

c 

> 

KROOTCODE).lnk 

69 

echo  LIST 

KROOTCODE).Bap 

» 

KROOTCODE).liik 

70 

echo  IIPUT 

KROOTCODE).trl 

» 

$(ROOTCODE).lnk 

71 

echo  EITRY 

KREITRY) 

» 

KROOTCODE).lnk 

72 

echo  LIBRARY  KT4LIB) 

» 

KROOTCODE).lnk 

73 

74 

tlnk  KROOTCODE).lnk 

75  $(R00TC0DE).trl:  $(R00TC0DE) . c 

76  pp  $(R00TC0DE).c  $(PP0PT4) 

77  tcx  $(R00TC0DE).pp  $(TCX0PT4) 

78  tasm  $(R00TC0DE).tal  l(TASMOPT) 

79 

80 
81 
82 

83 

84  # -  MODE  CODE 

85  # 

86 

87  KIODECODE):  KIODECODE)  .tld 

88 

89  KIODECODE). tld:  $(NODECODE)  .trl 


90 

echo  FLAG 

c 

> 

KIODECODE).  Ink 

91 

echo  LIST 

KIODECODE)  .nap 

» 

KIODECODE).  Ink 

92 

echo  IIPUT 

KIODECODE).  trl 

» 

KIODECODE).  Ink 

93 

echo  EITRY 

KIEITRY) 

» 

$(I0DEC0DE) .Ink 

94 

echo  LIBRARY  KT8LIB) 

» 

KIODECODE).  Ink 

95 

96 

tlnk  KIODECODE). Ink 

97  KIODECODE). trl;  KIODECODE). c 

98  pp  KIODECODE).  c  KPP0PT8) 

99  tcz  KIODECODE). pp  KTCX0PT8) 

100  tasB  KIODECODE)  .tal  KTASMOPT) 
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logc.mak 


101 

102  # 

103  « 

104 

105  ran:  KROOTCODE) .tld  t(IODECODE) .tld  R(IIF_FILE) .nil 

106  ld-n«t  $(IIF_FILE) 

107 
106 

109  # -  CLEAR  UP  - 

110  # 


111 


112  clean: 


113 

114 

115 

116 
117 
116 

119 

120 
121 
122 


dal  KROOTCODE). Ink 
dal  t(IODECODE).lnk 
dal  KROOTCODE). Bap 
dal  KlODECODE).Bap 
dal  KROOTCODE). tal 
dal  KR0DEC0DE).tal 
dal  KROOTCODE). pp 
dal  KlODECODE).pp 
dal  KROOTCODE). trl 
dal  KlODECODE).trl 


123 


124 

125  f  EOF  logc.Bak 
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1  « - ==========  MAKEFILE  FOR  MATRIX  LIBRARY  ========== - 

2  « 

3  #  SOURCE  aatlib.aak 

4  «  DATE  17  August  1991 

5  #  AUTHOR  Jonathan  E.  Hartaan,  U.  S.  laval  Postgraduate  School 

6  « 

7  *  PURPOSE  :  Make  the  natrix  library  ‘aatlib’. 

8  « 

9  *  REMARKS  :  This  aakelile  vorks  vith  Logical  Systeas  C,  version  89.1, 

10  #  and  the  Intel  iPSC/2  coapiler.  The  LogC  portions  of  this 


11  #  aakefile  actually  construct  libraries  oi  the  functions  available  in  the 

13  #  source  files  indicated.  There  are  tvo  libraries  generated — aatlib4.tll 

13  *  ft  aatlibS.tll - since  the  code  is  compiled  for  T414  or  T800  processors. 

14  *  For  the  Intel  coapiler,  I  have  not  created  a  library;  but  have  used  the 

15  #  object  code  as  needed.  There  are  a  fev  sections  that  pertain  to  both 

16  #  coapilers.  The  sections  that  only  pertain  to  a  particular  coapiler  are 

17  #  clearly  aarked  ‘Intel  iPSC/2'  or  ‘Logical  Systeas  C'. 

18  * 

20 

21 

22 

33 

24 


25  # - ==========  1.)  defiiitiois  aid  optiois  ========== - 

26  # 

27  *  The  follosing  options  and  definitions  are  required.  A  aore  thorough 

38  #  explanation  can  be  found  in  ‘logc.aak*  or  in  the  Logical  Systeas  C 

39  *  Transputer  Toolset  annual. 

30  * 

32 


33  THISMAKEFILE=natlib.Bak 

34 

35 

36  # - ===============  1.1)  Intel  iPSC/2  =============== - 

37  # 

38 

39  #  MATLIBDIR  is  the  directory  that  contains  the  aatlib  files 

40  MATLIBDIR  =  /usr/hartaan/aatlib 

41  OBJECTS  =  clargs.o  coaa.o  hcube.o  generate. o  aat.ops.o  aatrixio.o  aeaory.o  aath.o 
sep.o  tiaing.o  vec.ops.o 

42 

43 

44 

45 

46  # - ============  1,2)  Logical  Systeas  C  ============ - 

47  # 

48 

49  T414LIBIAME=Batlib4 
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50  T800LIBIANE=Batlib8 

51 

52  TRL4FILES=clarg8.trl4  coBB.trl4  coitplaz.tr 14  gen«rat«.trl4  Bachine.trl4  Bat_op8.trl4 
Bath.trl4  Batrixio.trl4  B8Bory.trl4  nuB_sya.trl4  S8p.trl4  tiBing.trl4  Toc_op8.trl4 

53  TRL8FILES=clarg8.trl8  coBB.trl8  coBplaz.trlE  ganarata.trlE  Bachine.trl8  Bat.opa.trlE 
Bath.trl8  Batrizio.trl8  BaBorj.trl8  nuB_8ys.tTl8  aap.trl8  tiBing.trl8  rac.opa.trlE 

54 

55  TLIB4FILES=clarg8  cobb  coBplaz  ganarata  Bachina  Bat.opa  Bath  Batrizio  Baaory  nna.sya 
sap  tiaing  vac.opa 

56  TLIB8FILES=clarg8  cobb  coaplaz  ganarata  aachina  aat.opa  aath  aatrizio  aaaory  nuB.sys 
sap  tiaing  vec.ops 

57 

58  PP0PT2*-dPR0T0TYPE  -dTRAISPUTER  -dT212 

59  PP0PT4=-dPR0T0TyPE  -dTRAISPUTER  -dT414 

60  PP0PT8=-dPR0T0TYPE  -dTRAISPUTER  -dT800 

Cl 

62  TCX0PT2=-cp2 

63  TCX0PT4=-cp4 

64  TCX0PT8=-cp8 

65 

66  TASMOPTs-ct 

67 

68  T2LIB=t21ib.tll 

69  T4LIB=Batlib4.tll  t4cuba.tll  t41ib.tll 

70  T8LIB-Batlib8.tll  t8cube.tll  t81ib.tll 

71 

72  REITRY=.Bain 

73  IEITRY=_n8_main 

74 

75 

76 

77 

78 

79  « - =======  2.)  IISTRUCTIOIS  FOR  DEFAULT  MAKE  ======= - 

80  # 

81  #  Tha  Yollosing  sactions  giva  tha  dalault  (sinca  they  appaar  first  in  the 

82  #  aahafila)  options  for  this  aakafila.  By  coBBsnting  one  or  the  other 

83  #  out,  one  can  gat  to  the  defaults  easily. 

84  # 

86 

87  ipse:  iBatlib 

88  clean:  iclaan 

89  #  tptr:  taatlib 

90  #  clean:  tclaan 

91 

92 

93  « - ===============  2.1)  Intel  iPSC/2  =============== - 

94  • 

95 
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»6  iaatlib:  $(QBJECTS) 

97 

9S 

99 

,00  # - s====s======  2.2)  Logical  Syataas  C  s==a==sr==s= - 

01  * 

,02  *  Make  avarything  and  install  in  the  library  diractory  dasignatad  by  the 
103  «  anrironaant  variabla  TLIB. 

04 

os 

106  taatlib: 

107  Bake  -1  KTBISNAXEFILE)  |(T414LIBIAME)  .til 

08  aaka  -1  t(THISNAKEFlLE)  installA 

109  Mka  -1  l(THISNAKEFILE)  tclaan 

110  aaka  -f  t(THISMAKEFILE)  $(T800LIBIANE) .til 

111  naka  -t  KTHISMAKEFILE)  installs 

112  aaka  -1  $(THISNAKEFILE)  tclaan 

113  Bake  -i  KTHISMAKEFILE)  install.haaders 

114 

115 

16  # -  CREATE  T414  VERSIOI  OF  THE  LIBRARY  - 

17 
,18 

,19  $(T414LIBIAME).tll  :  $(TRL4FILES) 

120  tlib  |(T414LIBiAME)  -b  $(TLIB4FILES) 

21 

,22  clarg8.trl4  :  clargs.h  clargs.c 
,23  pp  clargs.c  l(PP0PT4) 

24  tez  clargs.pp  $(TCX0PT4) 

,25  tasB  clargs.tal  $(TASN0PT) 

126 

127  coBB.trl4  :  comm.h  cobb.c 

128  pp  COBB.C  $(PP0PT4) 

129  tex  COBB.pp  $(TCX0PT4) 

130  tasB  coBB.tal  $(TASN0PT) 

131 

132  C0Bplaz.trl4  :  coaplaz.b  conplaz.c 

133  pp  coBplaz.c  t(PP0PT4) 

134  tez  coBplaz.pp  l(TCX0PT4) 

135  tasB  coBplax.tal  $(TASN0PT) 

136 

137  ganarato.trl4  :  ganarata.h  ganarata.c  Batriz.h  BaBory.trl4 

138  pp  ganarata.c  $(PP0PT4) 

139  tez  ganarata.pp  $(TCX0PT4) 

140  tasB  ganarata.tal  l(TASNOPT) 

141 

142  hcuba.trl4  :  hcube.h  hcuba.c 

143  pp  hcuba.c  t(PP0PT4) 

144  tez  hcuba.pp  ACTCX0PT4) 

146  tasB  hcuba.tal  $(TASN0PT) 
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146 

147  ■achine.trl4  :  aachine.h  machine. c 

148  pp  machine. c  $(PP0PT4) 

149  tcx  machine. pp  $(TCX0PT4) 

150  taam  machine. tal  iCTASNOPT) 

isi 

152  mat_op8.trl4  :  mat.ops.h  mat.ops.c  matrix. h 

153  pp  mat.ops.c  $(PP0PT4) 

154  tcx  mat.ops.pp  $(TCX0PT4) 

155  taam  mat.opa.tal  ICTASNOPT) 

156 

157  math.tr 14  :  math.h  math.c 

158  pp  math.c  $(PP0PT4) 

159  tcx  math.pp  t(TCX0PT4) 

160  taam  math. tal  $(TASN0PT) 

161 

162  matrixio.trl4  :  matrixio.h  matrixio.c  aacii.h  matrix. h  memory. trl4 

163  pp  matrixio.c  $(PP0PT4) 

164  tcx  matrixio.pp  t(TCX0PT4) 

165  taam  matrixio.ta]  S(TASMOPT) 

166 

167  memory. trl4  :  memory. h  memory. c  matrix. h 

168  pp  memory. c  $(PP0PT4) 

169  tcx  memory. pp  $(TCX0PT4) 

170  taam  memory. tal  l(TASMOPT) 

171 

172  num_8y8.trl4  ;  num.aya.h  num.aya.c  matrix. h 

173  pp  num.aya.c  $(PP0PT4) 

174  tcx  num.aya.pp  $(TCX0PT4) 

175  taam  num.aya.tal  $(TASN0PT) 

176 

177  aep.trl4  :  aep.h  aep.c 

178  pp  aep.c  $(PP0PT4) 

17?  tcx  aep.pp  $(TCX0PT4) 

180  taam  aep.tal  $(TASN0PT) 

181 

182  timing. trl4  :  timing. h  timing. c 
163  pp  timing. c  $(PP0PT4} 

184  tcx  timing. pp  $(TCX0PT4) 

185  taam  timing. tal  ICTASNOPT) 

186 

187  vec_opa.trl4  :  vec.opa.h  rec.ops.c 

188  pp  Tec.opa.c  t(PP0PT4) 

189  tcx  Tec.opa.pp  |(TCX0PT4) 

190  taam  vec.opa.tal  ICTASNOPT) 

191 

192 

193  # -  CREATE  T800  VERSIOI  OF  THE  LIBRARY  - 

194 

195 
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196  $(T800LIBVAKE) .til  :  t(TRL6FILES) 

197  tlib  KTSOOLIBkANE)  -b  $(TLIB8FILES) 

198 

199  clarga.trl8  :  clargs.b  clargs.c 

300  pp  clargs.c  $(PP0PT8) 

301  tcx  clargs.pp  $(TCX0PT8) 

303  tasB  clargs.tal  l(TASNOPT) 

303 

304  cou.trl8  :  conn.b  comm.c 

305  pp  COBB.C  t(PP0PT8) 

306  tcx  COBB.pp  $(TCX0PT8) 

307  tasa  coBB.tal  KTASMOPT) 

308 

309  coaplsx.trlB  :  coaplax.b  coaplsx.c 

310  pp  coaplsx.c  8(PP0PT8) 

311  tcx  coaplex.pp  $(TCX0PT8) 

313  tasa  coaplsx.tal  t(TASMOPT) 

313 

314  gansrats.trlS  :  gsnsrats.h  generate. c  aatrix.h  aeaory.trlS 

315  pp  generate. c  $(PP0PT8) 

316  tcx  generate. pp  $(TCX0PT8) 

317  tasa  generate. tal  $(TASM0PT) 

318 

319  hcttbe.tr  18  :  hcube.h  hciibe.c 

330  pp  hcttbe.c  $(PP0PT8) 

331  tcx  hcttbe.pp  $(TCX0PT8) 

333  tasa  bCttbe.tal  $(TASM0PT) 

333 

334  aachine.trl8  :  aachine.h  aachine.c 

335  pp  aachine.c  $(PP0PT8) 

336  tcx  aachine.pp  $(TCX0PT8} 

337  tasa  aachine.tal  $(TASM0PT) 

338 

339  aat_op8.trl8  :  aat.ops.h  aat.ops.c  aatrix.h 

330  pp  aat.ops.c  $(PP0PT8) 

331  tcx  aat.ops.pp  $(TCX0PT8) 

333  tasa  aat.ops.tal  KTASNOPT) 

333 

334  aath.trl8  :  aath.h  aath.c 

335  pp  aath.c  t(PP0PT8) 

336  tcx  aath.pp  $(TCX0PT8) 

337  tasa  aath.tal  l(TASNOPT) 

338 

339  aatrixio.trl8  :  aatrixio.h  aatrixio.c  ascii. h  aatrix.h  aeaory.trlS 

340  pp  aatrixio.c  |(PP0PT8) 

341  tcx  aatrixio.pp  8(TCX0PT8) 

343  tasa  aatrixio.tal  $(TASM0PT) 

343 

344  aeaory.trlS  :  aeaory.h  aeaory.c  aatrix.h 

345  pp  aeaory.c  t(PP0PT8) 
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246  tcz  Beaory.pp  $(TCX0PT8) 

247  taaa  ■•aory.tal  $(TASNOPT) 

248 

249  nua_*ys.trl8  :  naa.sya.h  niiB_ay8.c  atatrix.h 

250  pp  iiua_ay8.c  i(PP0PT8) 

251  tcz  nna.aya.pp  |(TCX0PT8) 

262  taaa  niia_ay8.tal  $(TASN0PT) 

263 

264  8ep.trl8  :  aep.h  aep.c 

265  pp  aap.c  t(PP0PT8) 

256  tcz  aap.pp  t(TCX0PT8) 

257  taaa  aep.tal  t(TASNOPT) 

268 

259  tiaing.trl8  :  tiaing.h  tiaing.c 

260  pp  tiaing.c  $(PP0PT8) 

261  tcz  tiaing.pp  $(TCX0PT8) 

262  taaa  tiaing.tal  KTASMOPT) 

263 

264  Tac_op8.trl8  :  vac.opa.c  vac.ops.h 

265  pp  vec.ops.c  l(PP0PT8) 

266  tcz  vac.ops.pp  l(TCX0PT8) 

267  taaa  vac.ops.tal  KTASMOPT) 

268 

269 

270  # -  COPY  LIBRARIES  TO  TLIB  DIRECTORY 

271 

272  inatallA: 

273  copy  KT414LIB*AME).tll  $(TLIB) 

274 

275  inatallS: 

276  copy  $(T800LIBNANE).tll  $(TLIB) 


277 

278 

279  «  -  COPY  HEADER  FILES  TO  STAIDARD  IICLUDE  DIRECTORY 

280 

281  inatall.headers : 

282  copy  aacii.h  $(TLIB)\. .Ninclude 

283  copy  aacroa.h  KTLIB)\. . Ninclude 

284  copy  aatriz.h  $(TLIB)\. .Ninclude 

285  copy  clarga.h  KTLIB)N.  .Ninclude 

286  copy  coaa.h  KTLIB)N.  .Ninclude 

287  copy  coaplez.h  KTLIB)N.  .Ninclude 

288  copy  generate. h  KTLIB)N.  .Ninclude 

289  copy  bcube.h  t(TLIB)N. .Ninclude 

290  copy  aachine.h  $(TLIB)N. .Ninclude 

291  copy  aat.opa.h  KTLIB)N.  .Ninclude 

292  copy  aath.h  $ (TLIB )N. .Ninclude 

293  copy  aatrizio.h  $(TLIB)N. .Ninclude 

294  copy  aeaory.h  KTLIB)N.  .Ninclude 

295  copy  nua.aya.h  $(TLIB}N. .Ninclude 
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396  copy  sep.h  $(TLIB)\. Aincludo 

397  copy  tiaing.h  t(TLIB)\. Ainclude 
396  copy  vec_op8.li  t(TLlB>\.  Ainclude 
399 

300 

301 
303 

303 

304  # - =====*==  3.)  FILE  KAIAGEMEIT  ft  UTILITIES  ======== 

305  # 

306  *  This  section  nakes  short  «ork  of  a  fes  useful /routine  tasks. 

307  « 

309 

310 

311  # - ====«=========  3.1)  Intel  iPSC/2  =============== 

312  # 

313 

314  iclean: 

315  ra  $ (OBJECTS) 

316 

317 
316 

319 

320 

321  # - ============  3.2)  Logical  Systens  C  ============ 

333  * 

323 

324  tclean: 

325  del  ♦.pp 

336  del  e.tal 

337  del  e.trl 

326 

329 

330  #  EOF  Batlib.nak  - 
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B.  NETWORK  INFORMATION  FILES 

hyprcube.nif  This  Network  Information  File  gives  a  fairly  complete  description  of 
the  hardware  configuration  used  to  perform  the  transputer  work. 
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1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 


- ========  lETWORK  IIFORIUTIOM  FILE  ======== - 

SOURCE  hypr cube. nil 

VERSIOM  :  1.1 

DATE  09  September  1991 

AUTHOR  Jonathan  E.  Hartman,  U.  S.  laval  Postgraduate  School 

USAGE  Id-net  hyprcube 

EDITIIG  :  replace  *rootcode’  mith  the  code  to  run  on  the  root 

replace  'nodecode'  with  appropriate  codeCs)  lor  the  nodes 


- ==============  REFEREICES  ================ - 

[l]  Inmos.  IKS  B012  User  Guide  and  Reference  Manual.  Inmos  Limited, 
1988,  Fig.  26,  p.  28. 


- DESCRIPTIOI  =============== - 

letuork  Information  File  (MIF)  used  by  Logical  Systems  C  (version  89.1) 
LO-HET  Hetsork  Loader.  This  file  prescribes  the  loading  action  to  take 
place  Bhen  the  'Id-net'  command  is  given  as  in  USAGE  above. 

- hardware  PREREQUISITES  ========= - 

lOTE:  There  are  three  node  numbering  systems:  the  one  created  by  Inmos’ 
CHECK  program,  the  Gray  code  labeling,  and  the  IIF  labeling.  Since  all 
three  mill  be  used  on  occasion,  I  sill  prefix  node  numbers  with  a  C,  G, 
or  I  to  identify  shich  system  I  am  using! 

The  IMS  B004  and  IKS  B012  must  be  configured  correctly.  The  B004's  T414 
has  link  0  connected  to  the  host  PC  via  a  serial-to-parallel  converter, 
link  1  connected  to  the  IKS  B012  PipeHead,  link  2  connected  to  the  T212 
[communications  manager  (not  used  here))  on  the  B012,  and  link  3 
connected  to  the  IMS  B012  PipeTail  (see  [1]).  By  the  way,  link  2  from 
the  B004  goes  to  the  the  ConfigUp  slot  just  under  the  PipeHead  slot 
(this  connects  it  to  the  T212) .  Finally,  the  B004's  Don  link  must  run 
to  the  B012's  Up  link. 


- SETTIIG  THE  C004  CROSSBAR  SWITCHES  ==== - 

Once  you  have  connected  the  hardware  in  the  fashion  mentioned  above, 
the  system  is  ready  to  be  transformed  to  a  hypercube.  Three  codes  by 
Mike  Esposito  are  used  here:  t2.nif,  root.tld,  and  switch. tld.  I  have 
a  batch  file  called  'makecube.bat’  that  performs  a  'Id-net  t2'  also. 

Mike’s  code  passes  instructions  to  the  T212  on  the  B012;  which,  in-turn 
tells  the  C004's  how  to  connect  their  switches.  After  the  code  has 
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51  ;  exAcuted,  the  (very  specific)  coafiguretion  that  ve  are  looking  lor 

52  ;  will  exist.  Specifically,  the  folloeing  (output  froa  CHECK  /R)  is  what 

53  ;  this  process  gives  us: 

B4  ; 

55  :  check  1.21 

56  ;  *  Part  rate  Kb  Bt  [  LinkO  Linkl  Link2  Links  3 

57  ;  0  T414b-16  0.09  0  [  HOST  1:1  2:1  3:2  3 

58  ;  1  T800C-20  0.80  1  [  4:3  0:1  5:1  6:0  3 

59  ;  2  T2  -17  0.49  1  [  C004  0:2  ...  C004  3 

60  ;  3  T800C-20  0.80  2  [  7:3  8:2  0:3  9:0  3 

61  ;  4  T800C-20  0.76  3  [  9:3  10:2  11:1  1:0  3 

62  :  5  T800d-20  0.90  1  [  8:3  1:2  10:1  12:0  3 

63  ;  6  T800d-20  0.76  0  [  1:3  12:2  7:1  11:0  3 

64  ;  7  T800d-20  0.76  3  [  13:3  6:2  14:1  3:0  3 

65  ;  8  T800d-20  0.90  2  [  14:3  15:2  3:1  5:0  3 

66  ;  9  T800C-20  0.77  0  [  3:3  13:2  15:1  4:0  3 

67  :  10  T800d-20  0.90  2  [  16:3  5:2  4:1  15:0  3 

68  ;  11  T800d-20  0.90  1  [  6:3  4:2  16:1  13:0  3 

69  :  12  T800d-20  0.77  0  [  5:3  16:2  6:1  14:0  3 

70  ;  13  T800d-20  0.77  3  C  11:3  17:2  9:1  7:0  3 

71  :  14  T800C-20  0.90  1  [  12:3  7:2  17:1  8:0  3 

72  :  15  T800C-20  0.90  2  [  10:3  9:2  8:1  17:0  3 

73  ;  16  T800C-20  0.76  3  [  17:3  11:2  12:1  10:0  3 

74  :  17  T800d-20  0.88  2  C  15:3  14:2  13:1  16:0  3 

75  : 

76  ;  Here  node  CO  is  the  root  transputer  (on  the  IMS  B004)  and  node  C2  is 

77  ;  the  T212  (on  the  IKS  B012) .  The  other  sixteen  nodes  are  the  T800's 

78  :  that  are  used  for  the  uork.  A  logical  interconnection  topology  is 

79  :  described  below. 

80  ; 

81  ; 

82  .  - TOPOLOGY  ============*=== - 

83  ; 

84  ;  The  physical  interconnection  scheae  described  above  is  an  actual  4-cube 

85  ;  with  one  exception.  The  root  node  (CO)  is  situated  BETVEEI  nodes  Cl 

86  ;  4uid  C3  (which  would  be  connected  directly  in  the  usual  4-cube) .  This 

87  ;  gives  us  two  3-cubes:  one  whose  node  labeling  is  GOxxx  and  the  other, 

88  ;  whose  node  labeling  is  Glxxx  (where  the  xxx  represents  all  perautations 

89  ;  of  3-bit8) .  These  are  the  usual  three  cubes,  and  they  will  exist  if  we 

90  ;  define  the  node  nuabering/labeling  correctly. 

91  ; 

92  ; 

93  ;  - *===============  STRATEGY  ======r=====a=== - 

94  ; 

95  ;  The  node  labeling  established  by  the  IIP  is  available  via  the  variable 

96  ;  _node_nuBiber  (see  <conc.h>)  in  source  code.  Therefore,  we  would  like  a 

97  ;  snart  labeling  scheme  in  the  IIP  file  so  that  prograuing  is  easier. 

98  ;  This,  of  course,  is  subject  to  the  restriction  that  IIP  labels  begin 

99  ;  with  II  and  so  on. 
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hyprcube.nif  L 


One  such  aetbod  would  be  to  define  n  IIF  labeling  so  that  the  Gray  code 
label  for  a  node  would  be  (.node.nuaber  -  2) .  In  fact ,  this  is 
possible  and  the  adjacencies  defined  below  allow  us  to  realize  this 
feature.  Below,  node  10  is  the  host  PC,  node  II  is  the  root  transputer 
(T414  on  the  B004) ,  12  through  117  correspond  to  CO  through  G15  (the 
nodes  of  a  4-cube},  and  118  is  not  used  (but  it’s  the  T212). 


111 

host.server 

cio.eze; 

(default) 

112 

113 

• 

TRAISPUTER 

RESET 

DESCRIPTIOI  OF 

Line  COIIECTIOIS 

114 

;  lODE 

LOADABLE 

COMES 

-- 

ns 

:  ID 

CODE  (.tld) 

FROM: 

LIIKO 

LIIKI 

LIIX2 

LIIK3 

116 

117 

1. 

rootcode. 

rO. 

0. 

2. 

» 

10 

B004 

118 

2, 

nodecode , 

rl. 

4. 

1, 

3. 

6 

B012 

119 

3, 

nodecode, 

r2. 

11. 

2. 

6. 

7 

120 

4, 

nodecode , 

rS, 

12. 

6. 

8. 

2 

121 

5, 

nodecode. 

r3. 

9. 

3, 

4, 

13 

122 

6. 

nodecode. 

r7. 

2. 

7. 

14. 

8 

123 

7. 

nodecode. 

r9. 

3. 

9, 

6. 

15 

124 

8. 

nodecode , 

r4. 

6. 

4, 

9. 

16 

125 

9. 

nodecode. 

r8. 

17. 

8. 

7, 

B 

126 

10, 

nodecode , 

rll. 

14. 

11. 

1, 

12 

127 

11. 

nodecode. 

rl3. 

16. 

13. 

10. 

3 

128 

12, 

nodecode. 

rie. 

10. 

16, 

13, 

4 

129 

13. 

nodecode, 

rl2. 

6, 

12, 

11. 

17 

130 

14. 

nodecode , 

r6. 

16. 

6, 

16, 

10 

131 

16. 

nodecode. 

rl4. 

7, 

14. 

17. 

11 

132 

16. 

nodecode , 

rl7. 

8, 

17, 

12, 

14 

133 

17, 

nodecode. 

rlS, 

13. 

16. 

16, 

9 

134 

:  18, 

switch. 

si. 

» 

1, 

> 

T212 

EOF  hyprcube.nif 
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C.  STANDARD  FILES 


macros. h  This  header  file  gives  several  C  macros  that  are  used  in  other  programs, 
matrix. h  This  header  file  establishes  the  standard  definition  of  a  matrix. 


macros.h 


1  /♦ - __________  program  iiformatioi  ========== - 

2  ♦ 

3  *  SOURCE  Bacros.h 

4  *  VERSIOH  :  1.3 

5  *  DATE  14  September  1991 

6  e  AUTHOR  Jonathan  E.  Hartman,  U.  S.  laval  Postgraduate  School 

7  ♦ 

9  */ 

10 

11 


12  fdefine  NAX(x.y)  (((x)  >  (y))  ?  (x)  :  (y)) 

13 

14  #deTine  MII(x,y)  (((x)  >  (y))  ?  (y)  :  (x)) 

15 

16  #deline  P0W2(n)  ((1)  «  (n)) 

17 

18 

19 

20 
21 

22  /* - ==============  EOF  macros.h  ========== 


*/ 
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matrix.h 


^  /♦ - __________  pROGRAH  IIFORMATIOI  ========== - 

2  * 

3  *  SOURCE  Batrix.h 

4  *  VERSIOH  :  2.0 

5  *  DATE  02  Septeaber  1991 

6  •  AUTHOR  Jonathan  E.  Hartaan.  U.  S.  laval  Postgraduate  School 

7  * 

8  * 

9  0 - descriptioi  =============== - 

10  * 


11  *  A  header  file  lor  a  laaily  of  functions  designed  to  sork  with  aatrices. 

12  * 

M  */ 


15 


16 

tinclude  "coaplex.h"  /* 

lor  Complex_T3rpe  */ 

17 

18 

19 

20 

/* - 

MAIIFEST  COISTAITS 

21 

22 

23 

Adeline 

BASE.TEN 

10 

24 

#def ine 

CURREHT 

1 

25 

Aifndef 

EXIT.FAILURE 

26 

Adel ine 

EXIT.FAILURE 

1 

27 

Aendif 

28 

Aifndef 

EXIT.SUCCESS 

29 

Adeline 

EXIT.SUCCESS 

0 

30 

Aendif 

31 

Adel ine 

FAILURE 

1 

32 

Adeline 

FALSE 

0 

33 

Adeline 

LIIE.LEHGTH 

80 

34 

Adeline 

MAX.IAME.LENGTH 

80 

35 

Adeline 

10 

0 

36 

Adeline 

OFF 

0 

37 

Adeline 

01 

1 

38 

Adeline 

0IE_BYTE 

1 

39 

Adeline 

OIE_NEKBER 

1 

40 

Adeline 

PREVIOUS 

0 

41 

Adeline 

SUCCESS 

0 

42 

Adeline 

TRUE 

1 

43 

Adeline 

TYPE_CHAR 

0 

44 

Adeline 

TYPE.DOUBLE 

1 

45 

Adeline 

TYPE.FLOAT 

2 

46 

Adeline 

TYPE. I IT 

3 

47 

Adeline 

YES 

1 

48 

49 

SO 
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matrix.h 


51  /• - ==========  type  DEFIfITIOIS  ========== - */ 

52 

53 

54  typcdel  struct  { 


55 

56 

char 

*nasie; 

57 

int 

rows. 

58 

cols; 

59 

60 

double 

s^matrix; 

61  >  Natrix.Type;  /♦  dsfault/standard  is  typs  double  */ 

62 

63 

64 

65  typedaf  struct  { 


66 

67 

char 

ename; 

68 

int 

roes , 

69 

cols; 

70 

Complex_Type 

♦*matr 

71 

72  }  Complex.Matrix.Type;  /*  type  Co«plex_Type  is  in  complex. h  •/ 

73 

74 

75 

76  typedef  struct  { 


77 

78 

char 

ename; 

79 

int 

roes, 

80 

cols; 

81 

82 

double 

♦•matrix; 

83  }  Double_Matrix_Type ; 

84 

85 

86 

87  typedel  struct  { 


88 

89 

char 

♦name; 

90 

int 

rows, 

91 

cols ; 

92 

93 

float 

♦♦matrix; 

94  >  Float_Matrix_Type; 

95 

96 

97 

98  typedef  struct  { 

99 

100  char  ^name; 
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L 


matrix.h 


101 

int 

roBS, 

102 

cols; 

103 

int 

♦♦matrix 

104 

105  >  lat.Matrix.Type : 

106 

107 

108  /* - - 


EOF 


■atrix.h 
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D.  SOURCE  CODE  FILES 


There  is  one  header  file  and  one  (.c)  source  code  file  for  each  remaining  member 
of  the  library,  so  the  filename  is  given  without  the  suffix. 

allocate  Memory  allocation  and  management  functions. 

clargs  For  processing  command-line  arguments. 

comm  Communications  functions  for  the  hypercubes. 

complex  Complex  numbers  and  operations. 

epsilon  Machine  precision  functions. 

generate  Matrix  generation  functions. 

io  Input/output  (10)  functions. 

mathx  A  small  extension  to  the  C  math  library. 

num_sys  Various  number  systems  (binary,  decimal,  hexadecimal). 

ops  Matrix  and  vector  operations. 

timing  Functions  for  timing. 

Again,  however,  most  of  the  source  code  has  been  omitted  and  only  the  header 
files  remain.  The  singular  exception  is  complex. c  because  this  source  contains  an 
algorithm  referenced  earlier  in  the  thesis. 


2.30 


allocate.h 


!  / 
2 

3 

4 

5 

6 

7 

8 

9  ’ 
10 
11 
12 

13 

14 

15 

16 

17 

18 
19 


=========  PROGRAM  IIFORMATIOM  ========== 


SOURCE 

VEilSION 

DATE 

AUTHOR 


allocate.h 

2.0 

09  September  1991 

Jonathan  E.  Hartman.  U.  S.  laval  Postgraduate  School 


- ==============  DESCRIPTIOI  ============= - 

Declarations  of  functions  associated  vith  memory  allocation. 


==========  list  of  FUICTIOIS  ===========: 


cmatallocO 

intvecallocO 

matallocO 


20 

21 


/ 


22 


23 

24 

25 


26 

27  / 


FUHCTIOH  DECLARATION  ========== 


28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 


PURPOSE:  This  function  performs  the  memory  allocation  for  a  matrix 

structure  (of  the  Complex_Matrix_Type)  using  the  C  function 
callocO.  Additionally,  it  fills  the  "rows"  and  "cols" 
fields  of  the  matrix  structure  retiumed  with  the  parameters 
passed  to  the  function.  If  3  structure  is  returned  (see 
"RETURNS"),  then  its  "rows"  and  "cols"  fields  will  be 
filled  with  the  correct  values.  The  structure  type  is 
defined  in  "matrix. h". 


INCLUDE;  "allocate.h" 

CALLS:  callocO 

CALLED  BY: 


PARAMETERS:  int  rows  the  number  of  rows  in  the  desired  matrix 

int  cols  the  number  of  columns  in  the  desired  matrix 


RETURNS;  A  pointer  to  the  structure  if  successful;  NULL  otherwise. 

The  NULL  case  includes  non-positive  rows  or  cols  in  addi¬ 
tion  to  the  obvious  allocation  failure. 
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allocate.h 


51 

52 

53 

54 

55 

56 

57 

58 

59 

60 
61 
62 

63 

64 

65 

66 

67 

68 

69 

70 

71 

72 

73 

74 

75 

76 

77 

78 

79 

80 
81 
82 

83 

84 

85 

86 

87 

88 

89 

90 

91 

92 

93 

94 

95 

96 

97 

98 

99 
100 


*  EXAMPLE:  CoBplez_Natriz_Typ«  *A: 

* 

♦  A  =  CBatalloc(7,  7); 


♦  - 
*/ 


#ifdal  PROTOTYPE 

CoBplez_Natriz_Typa  acBatallocdiit  rows,  int  cola); 

false 

CoBplez.Natriz.Type  *CBatalloc() ; 
fandlf 


FUICTIOI  DECLARATIOI 


PURPOSE;  This  function  performs  the  Bamory  allocation  for  a  vector, 
V,  of  num.alaBants  integer  eleaents. 

IICLUDE:  "allocate.h" 

CALLS:  callocO 

CALLED  BY: 

PARAMETERS;  See  PURPOSE. 

RETURNS;  A  pointer  to  the  array  if  successful  and  BULL  othersise. 

EXAMPLE:  int  desired.size.of _v  =  7, 

•  v; 

V  =  intvecalloc(desired_size_of_v) ; 


/ 

iifdef  PROTOTYPE 

int  *intvecalloc(int  nua.eleBents) ; 

false 
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allocate.h 


int  *intv«calloc(} ; 


fendif 


FUICTIOI  DECLARATIOI 


PURPOSE : 


IICLUDE: 


CALLS: 


This  function  perfona  the  Beaory  allocation  for  a  aatrix 
structure  using  the  C  function  callocO.  Additionally,  it 
fills  the  "ross"  and  “cols"  fields  of  the  aatrix  structure 
returned  vith  the  parasieters  passed  in  to  the  function. 

If  a  structure  is  returned  (see  ''RETURMS") ,  then  its  "roes" 
and  "cols"  fields  sill  be  filled  with  the  correct  values. 
The  structure  type  is  defined  in  "aatrix.h". 

"allocate.h" 

callocO 


CALLED  BY: 

PARAMETERS:  int  roes 
int  cols 


the  nuaber  of  rous  in  the  desired  natrix 
the  nuaber  of  columns  in  the  desired  aatrix 


RETURNS:  A  pointer  to  the  structure  if  successful;  NULL  othersise. 

The  NULL  case  includes  non-positive  rows  or  cols  in  addi¬ 
tion  to  the  obvious  allocation  failure. 

EXAMPLE:  Oouble.Matrix.Type  *A  =  aatalloc(7,  7); 


#ifdef  PROTOTYPE 


Double_Matrix_Type  *Batalloc(int  rows,  int  cols); 


telse 


Double_Matrix_Typo  *Batalloc(}; 


tendif 


EOF  allocate.h 


S55S5SSSS 
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PRaGRAM  IIFORMATIOI 


clargs.h 


1  /♦ - ======== 

2  • 

3  *  SOURCE  clargs.h 

4  *  VERSIOM  :  1.5 

6  *  DATE  09  Septembsr  1991 

6  *  AUTHOR  Jonathan  E.  Hartaan,  U.  S.  laval  Postgraduate  School 

7  * 

8  ♦ 

9  * - -  DESCRIPTIOI  ============= - 

10  * 

11  *  This  header  file  gives  the  declarations  to  accompany  clargs.c.  These 

12  *  files  provide  a  standard  (if  soaevhat  limited)  may  of  handling  command- 
la  *  line  arguments.  The  objective  is  to  handle: 

14  * 

15  *  1.)  Simple  boolean  arguments  like  "if  -v  exists,  set  verbose  =  TRUE". 

16  e  Ve  mill  call  such  an  argument  a  ‘simple*  argument  type.  This 

17  *  type  argument  can  be  recognized  by  the  fact  that  it  has  no 

18  *  sub-arguments  (the  sub-argument  count,  subarge  ==  0). 

19  * 

20  *  2.)  Arguments  with  sub-arguments  to  be  interpreted  as  numbers.  He 

21  *  sill  this  a  'complex*  argument  type.  Suppose  that  ve  sant  to  set 

22  *  int  dim  =  3  when  the  command  line  arguments  contain  "-d  3  ". 

23  *  This  case  implies  several  requirements: 

24  * 

25  V  a.)  First,  ve  must  knos  in  advance  hov  many  sub-arguments  the 

26  *  argument  has — se*ll  call  this  subarge  (in  this  case  ve  are 

27  *  expecting  one  sub-argument,  so  the  caller  sould  have  set 

28  *  subarge  =  1). 

29  * 

30  e  b.)  Secondly,  ve  must  knov  hov  to  interpret  each  sub-argument 

31  *  [i.e.,  vhat  type  is  the  sub-argument?  Is  it  a  double  or  long 

32  *  (float  and  int  can  be  handled  by  type  casting)?] 

33  * 

34  *  We  vill  call  this  kind  of  argument  a  complex  argument  type.  They 

35  *  can  be  recognized  as  those  vith  subarge  >  0. 

36  ♦ 

37  *  Here  is  the  strategy.  The  user  makes  a  list  of  valid  command-line 

38  *  arguments  by  creating  an  array  of  pointers  to  structures  of  type 

39  *  Arg  .Struct.  We  sill  call  this  the  option  list,  (Arg.Struct  e)  optvD. 

40  *  The  code  assumes  that  you  can  do  something  like  this  at  the  top  of  your 

41  *  source: 

42  ♦ 

43  *  fdefine  NAX.IUNBER.OF.ARGS  3 

44  * 

45  *  Static  Arg_Struct  eoptv [MAX.iUMBER.OF.ARGS]  ; 

46  * 

47  *  Let  (int)  optc,  be  the  option  count  (number  of  options).  Every  element 

48  *  in  (pointed  to  by)  the  option  list  is  a  structure  of  type  Arg_Struct 

49  *  defined  belov.  By  using  the  standard  C  arge  and  argv;  and  by  creating 

50  •  and  passing  optc  and  optv  around,  ve  can  manipulate  command-line 
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51 

52 

53 

54 

55 

56 

57 
56 

59 

60 
61 
62 

63 

64 

65 

66 

67 

68 

69 

70 

71 

72 

73 

74 

75 

76 

77 

78 

79 

80 
81 
82 

83 

84 

85 

86 

87 

88 

89 

90 

91 

92 

93 

94 

95 

96 

97 

98 

99 
100 


mrgiiB«nt8  just  about  housvsr  so  vast, 
the  structure. 


The  next  step  is  to  understand 


install.complez.argO 
install.sinple.argO 
interpret.args ( } 


LIST  OF  FUICTIOIS 


/♦ 


MAIIFEST  COISTAITS 


*/ 


«ifndeT 

tdeline 

«endiT 

«ifndef 

*def ine 

#endil 

fitndef 

#def ine 

tendil 

«ilndel 

idefine 

«endii 

fiindel 

#def ine 

fendil 

«ifndef 

tdeline 

fendil 


EXIT.FAILURE 

EXIT.FAILURE 

EXIT.SUCCESS 

EXIT.SUCCESS 

FALSE 

FALSE 

lULL 

lULL 

SUCCESS 

SUCCESS 

TRUE 

TRUE 


/• 

e  The  Bazinun  nuaber  of  characters  in  an  argunent  naae,  MAX.ARGLEI  is  a 
e  relatively  arbitrary  thing. .. .sake  it  whatever  you  want.  The  DOUBLE 
*  and  LOIG  nanifest  constants  are  assuaed  to  be  used  for  values  of 
e  subargi  (see  the  structure  below). 

•/ 


fdefine  KAX.ARGLEI 
fdefine  DOUBLE 
fdefine  LOIG 
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clargs.h 


argnane 


snbargc 


aubargi n 


101 
102 

103 

104 

105 

106 

107 

108 

109 

110 
111 
112 

113 

114 

115 

116 

117 

118 

119 

120 
121 
122 

123 

124 

125 

126 

127 

128 

129 

130 

131 

132 

133 

134 

135 

136 

137 

138 

139  typadel  struct  { 

140 


DATA  STRUCTURES  ===== 


lound 

daan 

laan 


The  (string)  naae  oi  a  valid  arguaent.  For  instance,  ii 
you  want  the  siaple  arguaent  "-v".  then  argnaaeU  sould  be 
*'-v".  If  you  have  a  coaplex  arguaent  that  sill  appear  as 
"-nuaber  3  4.6  6.7“,  then  argnaae  sill  be  "-nuaber"  and  you 
aust  use  the  sub-arguaent  variables  belos  to  handle  the 
integer  and  tso  floating-point  values. 

Consider  the  “-nuaber"  ezaaple  again.  There  are  three  sub- 
arguaents  (3,  4.6,  and  6.7)  so  the  sub-arguaent  count  would 
be  3. 

This  array  tells  us  how  to  interpret  the  subarguaents .  For 
instance,  again  using  the  “-nuaber"  ezaaple  above,  we  would 
set  subargiCO]  =  LOIG;  subargiCl]  =  DOUBLE;  and 
subargiC2]  =  DOUBLE. 

This  should  is  initialized  to  FALSE.  The  function 
interpret.argsO  will  set  this  field  TRUE  if  the  argnaae [] 
appears  on  the  command-line  (in  *argvG). 

This  field  is  an  array  of  double  sub-arguaent s . 

This  field  is  an  array  of  long  sub-arguaent s. 


Consider  the  "-number"  example  again.  After  argument  resolution,  we 
would  find  that  dsaCO]  is  not  defined  since  subargiCO]  ==  LOIG. 
However,  we  can  use  subargiU  to  verify  that  subargiCl]  and  subargi[2] 
are  DOUBLE.  Knowing  this,  we  can  safely  presuae  that  the  values  with 
CORRESPOHDIHG  index  in  dsaQ  should  be  interpreted  as  doubles.  That 
is,  dsaCl]  will  be  a  double  value  ^4.6)  and  dsa[2]  will  also  be  a 
double  (6.7).  In  a  siailar  wanner,  lsa[0]  aust  be  a  long  (3)  and 
IsaCl]  and  lsaC2]  are  not  defined. 
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142 

char 

argnaae [NAX.ARGLEI] ; 

143 

int 

subao-gc , 

/* 

how  aany  subarguaents  expected 

*/ 

144 

esubargi. 

/* 

how  to  interpret  subarguaents 

•/ 

145 

146 

found; 

/* 

set  TRUE  if  the  arguaent  is  found 

•/ 

147 

double 

edsa; 

/* 

double-valued  sub-arguaents 

•/ 

148 

149 

long 

*l8a; 

/* 

long-valued  sub-arguaent  list 

*/ 

150  >  Arg  .Struct ; 
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151 

152 

153  / 

154 


FUICTZOI  DECLARATIOI 


155 

PURPOSE: 

To  install 

a  valid  coa 

156 

♦ 

optvD  . 

157 

* 

158 

* 

IICLUDE: 

"clargs.h" 

159 

* 

160 

♦ 

CALLS: 

strcpyO 

161 

162 

CALLED  BY: 

163 

0 

164 

0 

PARAMETERS: 

int 

index ; 

165 

0 

Arg_Stnict 

♦optvn : 

166 

0 

const  char 

eargnaae ; 

167 

0 

int 

e interpret. 

168 

0 

subargc ; 

The  first  three  paraaeters  are  exactly  like  the  corresponding  ones  for 
install.simple.argO .  Additionally,  for  coaplez  argnaents,  ve  need  to 
pass  in  instructions  concerning  hos  many  sub-arguaents  there  are  (i.e., 
snbargc)  and  hov  to  interpret  each.  The  array  interpretD  should  be 
filled  with  subargc  eleaents  shen  you  call  this  function.  The  eleaents 
should  only  be  valid  ones  (e.g.,  DOUBLE,  L0I6). 


169 

170 

171 

172 

173 

174 

175 

176 

177 

178  */ 

179 

180  #ifdef  PROTOTYPE 

181 

182  void  install.coaplez.argCint  index,  .Struct  eoptvC] > 

183  const  char  eargnaae,  int  einterpret, 

184  int  subargc) ; 

185  #else 

186 

187 

188 

189 

190 

191 

192 

193 

194 

195 

196 

197 

198 

199 

200 


void  install.coaplex.argO ; 
#endif 


/• - 

e 

*  PURPOSE: 

* 
e 

*  IICLUDE: 


FUlCTIOl  DECLARATIOI 


To  install  a  valid  siaple  arguaent  in  the  option  list, 
optv  []  . 

"clargs.h" 
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clargs.h 


CALLS: 


CALLED  BY: 


PARAMETERS:  int  index; 

Arg  .Struct  *optv  □ ; 

const  char  *argnaae; 

The  'index’  gives  the  location  of  the  option  in  the  option  list, 
optvQ.  The  function  uses  this  index  to  install  the  argnaae  at  the 
proper  location  in  optvQ.  For  instance,  set  this  variable  to  zero  for 
the  first  option  in  the  list,  lonal  C  indexing  convention  applies; 
namely,  0  <~  index  <  MAX_IUNBER_OF_ARGS.  The  ‘argnaae’  is  the  string 
that  you  want  recognized  as  a  valid  argument.  For  instance,  suppose 
that  you  vant  a  timing  argument  to  be  recognized  whenever  "-t"  appears 
on  the  command  line.  Then  you  would  supply  "-t"  in  this  place. 


#ifdef  PROTOTYPE 

void  install.simple.argdnt  index,  Arg_Struct  *optvD, 

const  char  sargnaae) ; 

Aelse 


void  install.simple.argO ; 


fendif 


FUlCTIOl  DECLARATIOI 


PURPOSE:  Once  the  user  has  defined  an  appropriate  option  list, 

optvD,  with  optc  options,  this  function  parses  the 
command-line  arguments  (as  given  by  arge  and  argv)  and  fills  the 
eoptvQ  structures  appropriately.  For  instance  every  valid  (exists  in 
optv  ss>  valid)  argument  that  appears  on  the  command  line  will  result 
in  the  corresponding  optv  structure’s  ‘found’  field  being  set  to  TRUE. 
The  function  also  interprets  sub-arguments  and  fills  dsaQ  and/or  IsaH 
accordingly.  It  assumes  that  the  caller  has  established  the  desired 
argname’s,  subarge’s,  and  subargi’s. 

IICLUDE:  "clargs.h" 


CALLS: 


printf ( ) 
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251  *  strcapO 

252  *  strtodO 

253  *  atrtoK) 

254  * 

355  *  CALLED  BY: 

256  * 

257  *  PARAMETERS:  As  dsscribsd  in  PURPOSE. 

258  * 

260  */ 

261 

262 

263  #ifdsf  PROTOTYPE 

264 

265  void  intsrpret.argsCint  nrgc,  char  **argv,  int  optc,  Arg_Stmct  **optv) ; 

266 

267  #elss 

268 

269  void  int«rpret_args() : 

270 

271  tendii 

272 

273 

274  /* - s=ssss==saa==  EOF  ClargS.h  ssBssssssaasss - */ 
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1  / 
2 

3  ’ 

4 

5 

6  ' 
7 

6 

9  < 
10  ' 
11  ' 
12 

13  ' 

14  ' 

15 

16 

17 

18 
19 
30 
21 

32 

33 

24 

25 
36 

27 

28 
39 

30 

31 
33 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 


SOURCE  coam.li 

VERSIOI  :  2.S 

DATE  14  SaptcBbar  1981 

AUTHOR  Jonathan  E.  Hartnan.  U.  S.  laval  Poatgraduata  School 

- ,-,-.---.-,3==  DESCRIPTIOi  - - 

This  haadar  fila  givas  aanifast  conatants  and  function  apacificationa 

for  coBB.c.  Thaaa  filaa  contain  conunication  (and  ralatad)  functiona 
for  a  normal  hyparcuba  topology  and  a  hybrid  topology.  Unfortnnataly 
tha  coda  ia  a  bit  buay  «ith  fifdaf 'a,  but  tha  purpoaa  of  thaaa  filaa  ia 
to  aaka  hyparcubaa  a  littla  Bora  tranaparant.  Thia  aakaa  tha  coBB.h 
and  COBB.C  filaa  a  bit  hard  to  raad,  but  you  ahould  ba  able  to  racoup 
thia  loss  uhan  it  coaas  tima  to  writa  a  particular  application. 

- ==========*=*==  TOPOLOGIES  =======»===== - 

Tha  functions  apacifiad  balou  hava  baan  dasignad  to  uork  on  thraa  vary 
diffarant  machinaa.  Firat,  tha  Intal  iPSC/2  with  a  normal  hyparcuba  of 
ordar  0,  1,  2,  or  3  ia  handlad.  A  normal  hyparcuba  of  tranaputara  ia 
nazt  on  tha  liat  (alao  ordar  0,  1,  2,  or  3).  Finally,  thara  ia  a 
hybrid  topology  of  tranaputara  that  ia  handlad.  Tha  normal  hyparcubaa 
naad  almost  no  introduction.  Va  hava  a  boat  or  root  procaaaor/program 
togathar  vith  programs  running  on  tha  nodas.  I  vill  uaa  host  and  root 
intarcbangeably  hara,  although  'host*  ia  proparly  aaaociatad  vith  tha 
Intal  machina  and  ‘root*  is  tha  mora  corract/dascriptiva  tarm  vban  tha 
sub j act  ia  transputar  natvorka.  Tha  hybrid  topology  dasarvas  a  mora 
caraful  introduction. 

Tha  hybrid  topology  ia  a  natvork  of  Inmos  tranaputara  (PC  host  vith  an 
INS  B004  board  and  a  T414  linkad  to  aiztaan  T800  procassors  on  an  INS 
B012  boiurd)  arrangad  so  that  tha  ‘root*  is  aituatad  batvaan  nodas  zaro 
and  aight  of  a  4-cuba.  Thia  maans  that  nodas  0  and  8  ara  HOT  diractly 
connactad.  Tha  functions  mada  for  this  topology  companaata  for  thia 
situation.  Instaad  of  trying  to  dascriba  aach  function,  I  vill  simply 
ramark  that  tha  most  natural  vay  to  traat  this  problam  is  (mora-or- 
lass)  as  tvo  3-cubaa  attachad  to  tha  root.  A  mora  caraful  dascription 
of  hov  aach  problam  is  handlad  may  ba  found  in  tha  coda  for  tha  parti¬ 
cular  function. 

In  summary,  tha  transputar  portions  of  tha  coda  dapand  upon:  (1)  a  vary 
apacific  hardvara  configuration,  (2)  tha  appropriata  IIF  fila  to 
support  tha  usual  Gray  coda  in  a  convaniant  vay 

[  mynodaO  ==  .noda.numbar  -  2  ], 
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51  *  and  (3)  a  particular  link  arranganant  lika  that  can  ba  craatad  by  Nika 

52  a  Esposito’s  t2.nif,  root.tld,  and  ssitch.tld. 

53  a 

54  a  DETAILS:  Look  lor  additional  datails  in  hyprcuba.nil. 
ss  a 

S6  a 

87  * - PREREQUISITES  ============= - 

58  a 

59  a  Balora  using  any  ol  tha  functions  involving  sand()  or  racaivaO,  tha 

60  a  host  (or  root)  prograa  Bust  initializa.hyparcubaO .  For  transputar 

61  a  applications,  EACH  of  tha  lODES  Bust  initializa.hyparcubaO  too,  and 
63  a  you  naad  to  ba  sura  that  a  hyparcuba  azists  in  hardvara  and  that  your 

63  a  IIF  dascribas  a  hyparcuba  vith  tha  usual  Gray  coda.  You  Bust  dsfina 

64  a  tha  global  variablas  {Channal  aicQ  ,  aocD;}  bacausa  tha  coda  dapands 

65  a  upon  thair  azistanca.  Both  of  thasa  vactors  Bust  ba  of  langth 

66  a  (cubasiza4-l)  as  dascribad  in  tha  prafaca  to  initializa.hyparcubaO. 

67  a 

6S  a  Tha  cubasiza  and  dinansion  that  you  usa  vith  tha  transputar  inplananta- 

69  a  tion  datarmina  tha  cuba.  Evan  though  you  actually  hava  sixtaan  TSOO's 

70  a  in  tha  cuba,  tha  cubasiza  and  diaansion  that  you  usa  vill  dataraina  the 

71  a  portion  that  actually  gats  used,  lota  that  both  the  usual  hyparcuba 
73  a  and  tha  hybrid  4-cuba  are  built  upon  tha  saae  hardvara  and  link  setup. 

73  a  Many  of  tha  functions  declared  balov  OEPEID  upon  tha  proper  call  to  tha 

74  a  initializa.hyparcubaO  function.  To  avoid  difficulty,  observe  tha 

75  a  guidelines  given  vith  this  function!  Additionally,  in  tha  transputar 

76  a  case,  you  vill  naad  to  Bake  sura  that  you  include  <conc.h>. 

77  a 

78  a 

79  a . . ===========  list  of  FUICTIOIS  ============ - 

80  a 

81  a  coalesce O 
83  a  cubecastO 

83  a  cubacast.froaO 

84  a  directional.azchangeO 

85  a  diractional.racaivaO 

86  a  diractional.sandO 

87  a  haBBing.distanceO 

88  a  initializa.hyparcubaO 

89  a  laast.dinansionO 

90  a  link.nuBberO 

91  a  linkinO 

92  a  linkoutO 

93  a  racaivaO 

94  a  landO 

95  a  subBit ( ) 

96  a 

98  a/ 

99 
100 
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101  /* - -  MACROS  *  MAIIFEST  COISTAITS  ==== - */ 

102 

103  *ifd«f  TRAISPUTER 

104 


105 

fdef ine 

myhostO 

-1 

106 

fdefine 

mynodeO 

(.node. 

.number  -  2)  /*  depends  upon  <conc.h> 

*/ 

107 

108  false 

/*  iPSC/2  */ 

109 

110 

fdefine 

ALL.IODES 

-1 

111 

fdefine 

ALL.PIDS 

-1 

112 

fdefine 

AIY.IODE 

0 

/* 

for  receive(from  any  node,  ...  ) 

*/ 

113 

fdefine 

AIY.TYPE 

-1 

/* 

♦/ 

114 

fdefine 

ARBITRARY.TYPE 

0 

/* 

don’t  care 

*/ 

115 

fdefine 

KEEP.TIL.RELCUBE  1 

/* 

for  getcubeO 

*/ 

116 

fdefine 

lODE.PID 

0 

/* 

arbitrary  . . .  don  *  t  care 

*/ 

117 

fifndef 

BULL 

118 

fdefine 

BULL 

0 

119  tendif 

120 

121  #endif 

122 

123 

124  iilndal  FALSE 

125  «d«fizie  FALSE  0 

126  iandil 

127 

128  #iindel  TRUE 

129  tdeline  TRUE  1 

130  fendil 

131 

132 

133  /• - =========  FUICTIOI  DECURATIOI  ========= - 

134  * 

135  *  PURPOSE:  This  function  performs  the  first  step  in  the  opposite  of 

136  *  the  cubecastO  function.  That  is,  this  one  is  used  vhen 

137  *  jou  vant  to  collect  information  from  the  nodes  in  ‘higher  dimensions* 

138  *  of  the  hypercube  at  the  current  node.  You  may  vant  to  perform  some  work 

139  *  before  forwarding  this  information  down  to  the  next  lover  dimension,  so 

140  *  the  submit 0  function  is  given  separately. 

141  * 

142  e  Like  the  other  functions  in  this  file,  coalesceO  performs  a  somewhat 

143  e  different  task  when  executed  in  the  hybrid  4-cube,  so  first  ve  will 

144  *  discuss  the  usual  hypercubes.  coalesceO  is  a  null  operation  vhen 

145  *  called  from  in  the  highest  dimension  [  if  least_dimension(node)  is 

146  *  equal  to  dim  ] .  Otherwise  it  performs  the  communication  to  receive 

147  *  from  higher  dimensions  (i.e.,  neighbors  with  larger  node  numbers).  If 

148  *  it  is  called  from  the  host/root,  it  attempts  to  receiveO  from  node 

149  *  zero. 

150  » 
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152 

153 

154 

155 
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157 
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159 

160 
161 
162 

163 

164 

165 

166 

167 

168 

169 

170 

171 

172 

173 

174 

175 

176 

177 

178 

179 

180 
181 
182 

183 

184 

185 

186 

187 

188 

189 

190 

191 

192 

193 

194 

195 

196 

197 

198 

199 

200 


The  coalesceO  and  snbmitO  functions  aust  bs  balanced  properly  across 
the  nodes.  The  CALLER  aust  take  the  necessary  steps  to  be  sure  that 
buf  is  large  enough  to  hold  ((din  -  least_diaension(node))  *  len) 
bytes.  That  is,  there  vill  be  (dia  -  least_diaension(aode))  copies  of 
the  aessage  accuaulated  at  the  calling  node. 

There  are  several  exceptions  in  the  hybrid  4-cube  topology.  Since  the 
root  is  connected  to  nodes  0000  and  1000,  it  aust  aake  sure  that  buf 
can  hold  2  copies  of  length,  len.  Then  you  should  think  of  nodes  Oxxx 
as  one  3-cube  and  nodes  Ixxx  as  another  (aore-or-less  separate)  3-cube. 
That  is,  there  vill  be  no  exchanges  in  the  Ixxx  direction  between  them. 
To  deteraine  the  size  of  buf  at  any  node,  use  the  following  foraulae: 

(3  -  least_diaension(node))  *  len,  lodes  Oxxx 

(3  -  least_dimension(node  -  8))  e  len,  lodes  Ixxx 


CAUTIOUS:  If  you  fail  to  allocate  enough  space  for  buf,  you  aay  find 

that  your  program  doesn’t  work. 

The  transputer  implementation  depends  upon  the  parameter 
'type'  being  set  equal  to  cubesize. 

PREREQUISITE :  init ial ize.hypercube ( ) 


IICLUDE:  <conc.h>  (Logical  Systems  C,  version  89.1) 

"comm.h" 

CALLS:  least.dimensionO 

myhostO  (macro  given  above) 

pow2()  "aathx.h" 

receiveO 

CALLED  BY: 


EXAMPLE:  Suppose  we  are  ‘at’  node  0  and  we  want  to  coalesce ()  copies 

of  some  object  from  all  of  the  appropriate  nodes.  Let  the 
object  be  of  size  'len'  bytes.  For  concreteness,  let  the  topology  be  a 
hypercube  of  order  3  (i.e.,  dia  ==  3).  We  would  allocate  a  large  enough 
buf  to  bold  (dia  *  len)  bytes,  since  least_dinension(0)  ==  0.  That  is, 
node  0  will  be  receiving  from  all  neighbors  whose  least.dimensionO  is 
greater  [in  this  case,  that  is  ALL  of  its  neighbors];  naaely,  1,  2,  and 
4.  After  the  call,  we  would  find  the  data  from  node  1  in  the  first  len 
bytes  of  buf;  the  data  from  2  in  the  aiddle  len  bytes  of  buf;  and  the 
data  from  4  in  the  final  len  bytes  of  buf.  The  function  is  treated  as 
a  multiple  receiveO,  in  increasing  origin  order,  from  the  appropriate 
neighbors . 


PARAMETERS : 
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comm.h 


int  node  the  coalesce ()ing  (receiving)  node 
int  dim  the  dimension  of  the  hjpercube 

char  ebuf  a  pointer  to  the  beginning  of  the  buffer  where  you  want 

the  message  placed. 

long  len  the  number  of  bytes  to  be  received  from  EACH  node  in 
the  next  higher  dimension  that  sill  be  submit ()ing. 
long  type  the  type  of  the  message  (iPSC/2  applications  only),  or 
cubes ize  in  the  transputer  case. 


fdef  PROTOTYPE 


void  coalesceCint  node,  int  dim,  char  *buf,  long  len,  long  type); 


#else 


void  coalesceC/*  int  node,  int  dim,  char  ebuf,  long  len,  long  type  */); 


#endif 


223 

224  /* - =========  FUICTIOH  DECURATIOI  - - 

225  * 

226  *  PURPOSE:  This  function  is  called  from  the  root/host  and  all  nodes  to 

227  *  execute  a  broadcast  to  all  p  nodes.  The  host/root  sends  to 

226  *  node  zero  to  start  the  process  off.  Let  lg(n)  denote  log_2(n).  This 

229  *  function  performs  the  communication  in  lg(p)  steps.  For  instance,  node 

230  *  zero  receives  from  the  host  in  what  we'll  call  stage  zero.  Then,  in 

231  e  stage  1,  node  0  passes  the  message  to  node  1.  In  stage  2,  node  0  sends 

232  e  the  message  to  node  2  and  node  1  sends  it  to  node  3.  In  stage  three, 

233  e  nodes  0,  1,2,  and  3  each  send  the  message  to  nodes  4,  5,  6,  and  7 

234  ♦  (respectively) . 

235  ♦ 

236  *  Then,  in  general,  in  stage  i,  the  message  moves  into  the  ith  dimension. 

237  *  If  you  prefer,  you  can  think  of  a  pointer  starting  (after  the  message 

238  e  arrives  at  node  0)  at  the  rightmost  bit  (LSB)  and  indicating  the  direc- 

239  *  tion  for  the  next  transmission.  The  pointer  moves  left  until  it 

240  *  reaches  the  MSB.  This  is  the  final  stage  of  the  cubecastO. 

241  ♦ 

242  *  The  hybrid  4-cube  is  implemented  by  tending  the  message  from  the  root 

243  «  to  nodes  0  and  8  first.  Then  node  0  performs  the  usual  cubecast  for 

244  e  the  nodes  that  appear  in  the  usual  3-cube,  lode  8  mirrors  this  action, 

245  *  filling  the  other  thiee-cube  with  labels  like  Ixxx. 

246  ♦ 

247  «  In  all  cases,  buf  is  filled  with  an  initial  receiveO  from  the  proper 

248  *  node ,  and  then  it  is  used  in  retransmissions  to  other  nodes .  In  any 

249  e  event,  buf  holds  the  message  after  execution. 

250  • 


241 


CAUTIOH:  The  transputer  iaplementation  depends  upon  the  paraaeter 

'type*  being  set  equal  to  cubes ize. 


PREREQUISITE :  init ialize.hypercube ( ) 


IICLUDE:  <conc.h> 

"cona.h" 


CALLS: 


least.diaensionO 

miO 

■yhostO 

pos2() 

receiveO 

sendO 


(Logical  Systems  C.  version  89.1) 


(macro  from  macros. h) 
(macro  from  above) 
••mathx.h" 


CALLED  BY; 


PARAMETERS : 


int  node  the  sending  node 

int  dim  the  dimension  of  the  hyperc\:be 

char  *buf  a  pointer  to  the  head  of  the  message 

long  len  the  number  of  bytes  to  be  passed 

long  type  the  type  of  the  message  (iPSC/2  applications  only),  or 
cubesize  in  the  transputer  case. 


#ifdef  PROTOTYPE 

void  cubecast(int  node,  int  dim,  char  *buf,  long  len,  long  type); 


#else 


void  cubecastC/*  int  node,  int  dim,  char  *buf,  long  len,  long  type  */); 


tendif 


FUICTIOI  DECURATIOI 


PURPOSE;  This  function  is  similar  to  cubecastO  but  more  general. 

Here  ve  do  not  assume  that  the  message  starts  at  the  host 
or  at  node  zero;  it  may  start  at  any  general  source  node,  src.  In  fact, 
it  may  lOT  be  called  from  the  root/host  (use  cubecastO  in  that  case). 


comm.h 


li  dim  is  the  order  ol  the  hypercube,  then  src  goes  through  dim  stages, 
passing  the  message  to  its  neighbors.  The  sequence  is  defined  by  an 
XOR  operation  that  starts  at  bit  1  of  src  and  moves  up  through  bit  dim. 
For  instance,  suppose  src  ==  6  ==  101b  in  the  3-cube  (dim  ==  3).  Then 
src  vill  first  send  to  (101  XOR  001)  s=  node  4,  next  to  (101  XOR  010) 

==  node  7,  and  finally  to  (101  XOR  100)  ==  node  1.  Neanshile,  any  time 
that  a  non-source  node  gets  the  message,  he  begins  the  same  process, 
but  only  picks  it  up  at  the  appropriate  stage  (the  one  after  the  stage 
in  uhich  he  received  the  message) . 


PREREQUISITE:  initialize_hypercube() 


IICLUDE:  <conc.h> 

"comm.h" 

CALLS:  directional.receiveO 

direct ional.send ( ) 
freeO 

least.dimensionO 

mallocO 

poB2() 

receiveO 

sendO 

sizeof () 


CALLED  BY: 

PARAMETERS : 

int  src 
int  node 
int  dim 
char  ebuf 
lone  len 


(Logical  Systems  C,  version  89.1) 


"mathx . h" 


the  source 

the  number  of  the  node  calling  this  function 
the  dimension  of  the  hypercube 
a  pointer  to  the  head  of  the  message 
the  number  of  bytes  to  be  passed 


#ifdef  PROTOTYPE 


void  cubecast_from(int  src,  int  node,  int  dim,  char  ebuf,  long  len); 


#else 


void  cubecast.fromO ; 


# end if 
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FUICTIOI  DECLARATIOI 


PURPOSE:  To  perform  an  exchange  along  a  prescribed  direction.  The 

direction  ie  given  as  an  integer  in  {1.  2,  4.  8 . 2*dim>. 

This  is  because  the  direction  is  really  a  bit  stash  for  the  Gray-coded 
node  numbers.  For  instance,  if  you  perform  a  direct ional_exchange() 
from  node  ==  3  ==  Oil  in  the  3-cube  along  direction  ==  4  ==  100,  this 
is  the  same  as  performing  a  coordinated  send()  and  receiveO  combina¬ 
tion  nith  node  (Oil  XOR  100  ==  111  »  7).  Care  is  taken  to  make  sure 
that  deadlock  does  not  occur. 


PREREQUISITE:  initialize_hypercube() 


IICLUDE:  <conc.h> 

"comm.h" 

CALLS:  po¥2() 

receiveO 

sendO 

CALLED  BY: 

PARAMETERS : 


(Logical  Systems  C,  version  89.1) 


"mathx.h" 


int  node 
int  dim 
int  direction 
chair  vibuf 
char  *obuf 
long  len 


the  number  of  the  node  calling  this  function 
the  dimension  of  the  hyper cube 
as  described  above  (1,  2,  4,  8,  etc.) 
a  pointer  to  the  head  of  the  incoming  message 
a  pointer  to  the  head  of  the  outgoing  message 
the  number  of  bytes  to  be  passed 


#ifdef  PROTOTYPE 

void  direct ional_exchange( int  node,  int  dim,  int  direction, 

char  *ibuf,  chaur  *obuf,  long  len); 


telse 


void  directional.exchangeO ; 


#endif 


247 
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FUlCTIQI  DECLARATIOI 


PURPOSE:  To  receive  Iron  a  prescribed  direction.  The  direction  is 

as  described  in  directional.exchangeO  above. 


PREREQUISITE:  initialize.hypercubeO 


(Logical  Systeas  C,  version  89.1) 


••■athx.h*' 


IICLUDE:  <conc.h> 

"coBim.h'' 

CALLS:  poe2() 

receiveO 

CALLED  BY: 

PARAMETERS : 


int  node  the  number  of  the  node  calling  this  function 

int  dim  the  dimension  of  the  hypercube 

int  direction  direction  to  receive  from 

char  *buf  a  pointer  to  the  head  of  the  message 

long  len  the  number  of  bytes  to  be  passed 


#ifdef  PROTOTYPE 

void  directional_receive(int  node,  int  dim,  int  direction, 

char  *buf ,  long  len) ; 


#else 


void  directional_receive() : 


#endif 


FUlCTIOl  DECURATIOI 


*  PURPOSE:  To  send  in  a  prescribed  direction.  The  direction  is  as 

*  described  in  directional.ezchangeO  above. 


*  PREREQUISITE:  initialize.hypercubeO 


*  IICLUDE;  <conc.h> 

•  "comm.h" 


(Logical  Systems  C,  version  89.1) 
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452 

453 

454 

455 

456 

457 

458 

459 

460 

461 
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463 
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465 
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467 

468 

469 

470 

471 

472 
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474 

475 
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477 
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479 

480 

481 

482 

483 

484 

485 

486 

487 

488 

489 

490 

491 

492 

493 

494 

495 

496 

497 

498 

499 

500 


CALLS:  poB2() 

sendO 


CALLED  BY; 


••■athx.h” 


PARAMETERS : 


int  node 
int  dim 
int  direction 
char  ebul 
long  len 


the  number  of  the  node  calling  this  function 
the  dimension  of  the  hypercube 
direction  to  send  to 
a  pointer  to  the  head  of  the  message 
the  number  of  bytes  to  be  passed 


*/ 

#ifdef  PROTOTYPE 

void  directional_send(int  node,  int  dim,  int  direction, 

char  *buf ,  long  len) ; 

#else 

void  directional.sendO ; 
fendif 


/ 


FUMCTIOR  DECLARATIOM  ====== 


PURPOSE:  To  give  the  Hamming  distance  between  i  and  j. 

IICLUDE:  "comm.h" 


CALLS:  sizeofO 

CALLED  BY: 


PARAMETERS;  int  i,  j  the  numbers 

RETURIS:  (int)  the  Hamming  distance(i, j) .  That  is,  the  number  of 

ones  in  the  binary  exclusive  OR  (i  XOR  j). 


/ 


2-19 


501  «ildef  PROTOTYPE 

502 

503  int  haaming.distaacednt  i,  int  j); 

504 

505  felse 

506 

507  int  haiiming_distanc«(/*  int  i,  int  j  */); 

508 

509  tend  if 


510 

511 

512  /* - =========  FUlCTIOI  DECLARATIOI  ========= - 

513  ♦ 

514  *  PURPOSE:  The  initial ize_hypercube()  function  creates  the  hypercube 

515  *  and  performs  the  required  setup  for  coanunications.  It 

516  *  must  be  completed  before  you  expect  to  coamunicate.  On  the  iPSC/2. 

517  *  OILY  the  host  code  should  call  this  function.  For  transputer  implemen- 

518  *  tations  every  node  should  call  it  (in  addition  to  the  root  node) .  This 

519  *  is  prerequisite  to  most  of  the  other  functions  in  this  file.  The  basic 

520  *  requirements  for  this  function  are  so  different  (machine  dependent) 

521  e  that  there  are  tuo  versions:  one  for  the  transputers  and  one  for  the 

522  *  iPSC/2  machine. 

523  * 

524  *  IICLUDE:  "comm.h" 

525  * 

526  *  CALLS:  attachcubeO  (Intel  iPSC/2  C  Library) 

527  *  callocO 

528  *  freeO 

529  ♦  getcubeO  (Intel  iPSC/2  C  Library) 

530  *  linkinO 

531  *  linkoutO 

532  ♦  loadO  (Intel  iPSC/2  C  Library) 

533  *  mallocO 

534  ♦  printfO 

535  *  setpiaO  (Intel  iPSC/2  C  Library) 

536  *  sizeofO 

537  *  strcpyO 

538  * 

539  *  CALLED  BY: 

640  * 

541  *  PARAMETERS:  In  both  cases,  the  desired  dimension  of  the  hypercube  is 

542  *  passed  in  as  the  first  argument.  After  this,  the  functions 

543  e  are  quite  different. 

644  * 

545  *  (1)  iPSC/2  - 

646  * 

547  *  Chau:  snodecode  A  pointer  to  the  filename  of  the  nodecode  is 

548  *  required  so  that  the  function  can  load  the  node 

649  •  program. 

550  • 
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(2)  transputers 


Channel  eic[(CUBESIZE  *  1)]  This  is  the  inconing  channel  list. 
You  Bust  declare  it  globallj.  Let  CUBESIZE  be  the  nunber  oi 
transputers  in  the  hypercube.  Then  icQ  is  a  vector  oi  length 
(CUBESIZE  *  1).  The  indexing  is  such  that  (icCn]  ==  C),  where 
n  is  soae  neighbor  and  C  is  the  incoaing  Channel*  fr^  n.  For 
instance,  if  node  k  finds  that  icCn]  »  LIIXIII  then  node  k 
knows  to  receive  aessages  froa  node  n  via  LIlKill.  The  eleaent 
ic [CUBESIZE]  holds  the  channel  for  the  root  node  (if  any). 
ic[n]  BULL  aeans  that  there  is  no  connection  to  node  n. 

Channel  *oc [(CUBESIZE  1)3  is  the  outgoing  channel  list.  It 

is  coapletely  analogous  to  icD  except  that  it  will  hold 
LIIKOOUT,  Line  1 OUT.  LIIK20UT.  or  LIIK30UT  for  the  appropriate 
node  index.  Your  only  obligation  is  to  define  these  lists  as 
globals  in  the  aanner  shown.  The  Channel  pointer  eleaents  will 
be  filled  in  by  initialize_h3^>ercube() . 

RETURHS:  The  iPSC/2  version  of  the  function  returns  a  pointer  to  the 

naae  of  the  cube.  In  the  transputer  environaent,  the  cube- 
naae  has  no  aeaning,  so  a  void  function  suffices.  For  the 
transputer  environment,  the  single  aost  iaportant  task  that 
initialize.hypercubeO  performs  is  the  filling  of  icD  and 
ocG.  These  vectors  are  used  by  most  of  the  other  coaauni- 
cations  functions. 


tifdef  TRAISPUTER 


void  initialize_hypercube(int  dim); 


Seise 


char  *initialize_hypercube(/*  int  din,  char  *nodecode  •/} ; 


Sendif 


FUlCTIOl  DECLARATIOI 


PURPOSE: 


This  function,  called  from  any  node  in  the  hypercube, 
returns  the  diaension  of  the  saallest  hypercube  containing 
that  node. 


IICLUDE: 


"conm.h' 


CALLS: 


CALLED  BY: 


po«2() 


PARAMETERS:  int  node 


••■athx.h” 


the  inquiring  node 


RETURIS:  For  an  n-cnbe  containing  Pss2*(n)  proceaaore,  this  function 

is  designed  to  work  for  nodes  nnabersd  0  through  (P-1).  If 
the  function  is  called  fron  the  root  (host)  node,  there  is  no  guarantee 
as  to  the  returned  value.  If  it  is  called  bj  a  valid  node,  it  sill 
return  the  dinension  of  the  snallest  h]rpercnbe  containing  that  node 
nunber.  For  instance  least_diBieasioa(0)  »  o,  least_dinension(l)  -=  1, 
least_dimension(2)  ==  2,  least.dinensionO)  2,  and  least.dinension 
(8)  ==  4. 


#ifdef  PROTOTYPE 


int  least.dinensionCint  node); 


felse 


int  least_dimension(/*  int  node  •/); 


#endif 


FUICTIOI  DECLARATIOIS 


PURPOSE:  The  receiveO  and  send()  functions  declared  belos  provide 

cowunication  to  (fron)  a  buffer  pointed  to  by  buf .  The 
volume  of  material  to  send  (receive)  is  indicated  in  bytes  by  the  len 
argument.  The  destination  (origin)  is  given  by  the  first  argument, 
using  a  valid  node  number.  Suppose  you  have  an  n-cube  established  upon 
a  system  with  p  ==  (2*n)  node  processors.  Then  you  should  refer  to  the 
nodes  of  the  hypercube  by  their  node  number,  which  is  a  Gray  coded 
value  in  the  range  [  0,  (p-1)  ].  If  you  are  at  the  root,  of  course, 
you  may  not  communicate  with  the  root  (at  least  not  with  these  func¬ 
tions);  but  if  you  are  at  one  of  the  nodes  of  the  hyper cube,  you  may 
communicate  sith  the  root  by  using  myhostO  as  the  origin  (or  destina¬ 
tion)  of  your  message.  The  macro  given  above  makes  myhostO  available 
on  the  transputers. 


651  *  Transputers  or  iPSC/2?  The  type  paraaetar  is  only  used  in  the  implied 

652  *  sense  with  the  iPSC/2  implementation  [  it  becomes  type  or  typesel  lor 

653  *  csendO  or  crecvO  ].  For  transputer  implementations,  type  MUST  BE  set 

654  *  equal  to  the  number  ol  nodes  in  the  hypercube  (e.g.,  p  in  the  example 

655  *  above).  I  have  called  this  ‘cubesize*  in  most  ol  my  relerences. 

656  * 

657  *  PREREQUISITE:  initiali2e_hypercube() 

658  ♦ 

659  *  IICLUDE:  <conc.h>  (Logical  Systems  C,  version  89.1} 

660  ♦  ''comm.h" 

661  ♦ 

662  *  CALLS:  ChanInO  (Logical  Systems  C,  version  89.1) 

663  *  ChanOut ( ) 

664  ♦  crecvO  (Intel  iPSC/2  C  Library) 

665  *  csendO 

666  ♦ 

667  ♦  CALLED  BY: 

666  * 

669  * - ===  =  =  =======  ====  CAUTIOI  ========  ====  ==== - 

670  * 

671  *  Make  sure  type  ==  cubesize  in  the  transputer  case  (see  the  note  above) ! 

672  ♦ 

674  •/ 

675  #ildel  PROTOTYPE 

676 

677  void  receiveCint  origin,  char  sbul,  long  len,  long  type); 

678 

679  void  sendCint  destination,  char  *bul,  long  len,  long  type); 


680 

681  Velse 

682 

683  void  receive (/•  int  origin,  char  *buf,  long  len,  long  type  */); 

684 

685  void  sendC/*  int  destination,  char  ebul,  long  len,  long  type  ♦/) ; 


686 

687  «endil 

688 

689 

690  /* - =========  FUICTIOI  DECURATIOI  ========= - 

691  * 

692  *  PURPOSE:  This  lunction  is  called  Irom  the  nodes  to  submit  a  message 

693  *  to  the  next  loser  dimension.  II  it  is  called  Iron  the  host 


694  *  (root)  it  has  no  ellect.  When  it  is  called  Iron  node  zero,  the  trans- 

695  *  mission  is  directed  to  the  root/host.  When  called  iron  any  other  node, 

696  *  the  information  in  buf  is  passed  to  the  proper  node  in  the  next  loser 

697  *  dimension.  The  loser  dimension  must  have  an  accepting  coalesceO  or 

698  *  other  receiving  function  [  coalesceO  and  submitO  are  meant  to  be  used 

699  *  in  a  balanced  fashion,  share  each  submitO  or  group  of  submitO 's  in 

700  *  one  dimension  is  matched  by  a  coalesceO  in  the  next  loser  dimension  ]. 


PREREQUISITE:  initialize.bypercubcO 


IICLUDE: 


CALLS: 


CALLED  BY: 


<coiic.h> 

"coan.h" 

Icast.diBensionO 

poa2() 

■endO 


(Logical  Systaas  C,  vers ion  89.1) 


"■athx.h" 


EXCEPTIOIS:  Again,  va  hava  tha  hybrid  hyparcuba  in  tha  tranaputar  case 
(aaa  many  coamanta  abova) .  Tha  ganaral  mla  ia  changad  in 
thia  case  since  node  1  snbmitOs  to  tha  root  and  not  node  0.  This  is 
the  only  change. 

SPECIFICS:  li  you  need  to  detazsina  azactly  vhara  a  subaiitO  sill  go, 
you  can  iigura  it  out  in  tha  lollosing  manner  [  with  the 
obvious  EXCEPTIOHS  (the  previous  paragraph)  ]  .... 

Suppose  you  are  'at’  node  i  in  an  n-cube  (p  processors  =  2*n).  You 
must  submitO  information  to  the  (unique)  node,  j,  that  satisfies  tuo 
requirements : 

(1)  hamming.distanceCi,  j)  —  1 

(2)  least.dimensionCi)  -=  (least.dimension( j)  *  1) 

So,  for  instance,  consider  a  4-cube  where  i  ==  12.  It  should  be  fairly 
easy  to  see  that  j  will  be  node  4.  This  is  because  these  two  nodes  are 
adjacent  and  they  are  one  dimension  apart  in  the  cube  (i.e.,  node  4 
first  appears  in  a  3-cube  and  node  12  first  appears  in  a  4-cube). 

PARAMETERS : 

int  node  the  sending  node 

int  dim  the  dimension  of  tha  hyparcuba 

char  *buf  a  pointer  to  the  head  of  the  message 

long  len  the  number  of  bytes  to  be  passed 

long  type  the  type  of  the  massage  (iPSC/2  applications  only),  or 
cubes ize  in  tha  transputer  case. 


#ifdef  PROTOTYPE 


void  submit (int  node,  int  dim,  char  ebuf,  long  len,  long  type); 


comm.h 


751 

752  #elS6 

753 

754  void  submit (/*  int  node, 

755 

756  tendil 

757 

758 

759  /♦ - ===a====s==s== 


int  dim.  char  *bul.  long  len,  long  type  */) 


EOF  comm.h 


*/ 
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complex. h 
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Jonatban  E.  Hartman,  U.  S.  laval  Postgraduate  School 


REFEREICES 


[1]  Goldberg,  David.  “What  Every  Computer  Scientist  Should  Know  About 
Floating-Point  Arithmetic*’.  ACM  Computing  Surveys,  Vol.  23, 
lo.  1,  March  1991. 


DESCRIPTIOl 


This  file  contains  the  definition  of  Complex.Type  and  declarations  of 
functions  that  perform  operations  uith  complex  numbers: 

cadd ( ) 

cdivO 

cmul ( ) 

csubO 

ImO 

ReO 


TYPE  DEFIIITIOM 


typedef  struct  { 

double  X,  /*  real  part  ♦/ 

y:  /♦  imaginary  part  ♦/ 

>  Complex.Type ; 


FUHCTIOI  DECLARATIOI 


*  PURPOSE;  To  add  tuo  complex  numbers,  zl  and  z2,  and  place  their  sum 


complex.h 


in  the  Complex.Type  **sub'. 

IICLUDE:  "complex.h" 

PARAMETERS:  The  parameters  give  the  tso  operands  zl  and  z2,  and  a 

pointer  to  the  result,  sum. 

EXAMPLE:  Complex.Type  zl,  z2,  z3: 

caddCzl,  z2,  Az3); 


#ildel  PROTOTYPE 

void  cadd(Complex_Type  zl,  Complex.Type  z2,  Complex.Type  esum); 
#else 

void  caddO; 
tendif 


FUlCTIOM  DECLARATIOH 


PURPOSE:  To  divide  two  complex  numbers,  (zl  /  z2) ,  and  place  the 

result  in  the  Complex.Type  'equotienf. 

ALGORITHM:  The  code  uses  Smith's  formula  (page  25  of  [1])  to  perform 

the  division. 

IICLUDE:  "complex.h" 

PARAMETERS:  The  parameters  give  the  two  operands  zl  and  z2,  and  a 
pointer  to  the  result,  quotient. 

EXAMPLE:  Complex.Type  zl,  z2,  z3; 

cdiv(zl,  z2,  *z3); 


#ifdef  PROTOTYPE 
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102 

103 

104 

105 

106 

107 

108 
109 

no 

111 

112 

113 

114 

115 

116 

117 

118 

119 

120 
121 
122 

123 

124 

125 

126 

127 

128 

129 

130 

131 


void  cdiv(Compl«x.Type  zl,  Cosplez.Tjpe  z2,  Coaplex.Type  ^quotient); 

#«l8e 

void  cdivO: 

«endil 


FUlCTIOl  DECLiRATIOl 


PURPOSE:  To  nultiply  tvo  complex  numbers,  zl  and  z2,  and  place  their 

product  in  the  Complex.Type  ' ^product ’ . 

IICLUDE:  "complex. h" 

PARAMETERS:  The  parameters  give  the  tuo  operands  zl  and  z2,  and  a 
pointer  to  the  result,  product. 

EXAMPLE:  Complex.Type  zl,  z2,  zS; 

cmuKzl,  z2,  *z3); 


#ifdel  PROTOTYPE 


132 

133  void  cmul (Complex.Type  zl,  Complex.Type  z2,  Complex.Type  ^product); 

134 

135  felse 

136 

137  void  cmulO; 

138 

139  #endil 

140 

141 

142 

143 

144 


145  /♦ - ==========  FUICTIOI  DECLARATIOI  ========== - 

146  ♦ 

147  *  PURPOSE:  To  place  the  difference  of  tso  complex  numbers,  (zl  -  z2), 

148  *  into  the  Complex.Type  ’*difference’ . 

149  ♦ 

150  *  IICLUDE:  "complex. h" 
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complex.h 


201 
202 

203 

204 

205 

206  /♦ 

207  ♦ 

208  *  PURPOSE:  This  function  returns  the  real  peurt  of  a  complex  number,  z. 

209  ♦ 

210  *  PARAMETERS:  The  complex  number,  z.  is  passed  into  Re(}. 

211  * 

212  *  RETURIS:  The  real  part  of  z  as  type  double. 

213  ♦ 

214  *  EXAMPLE:  x  =  Re(z); 

215  ♦ 

216  • - =  =  =  =====  =  =  =  =  =  =============================  ===  = - 

217  */ 

218 

219 

220  #ifdef  PROTOTYPE 

221 

222  double  Re(Complex_Type  z) ; 

223 

224  felse 

225 

226  double  Re(); 

227 

228  fendif 

229 

230 

231  /♦ - ============  EOF  complex.h  ============ - ♦/ 
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complex. c 


1 

2 

3 

4 

5 

6 
7 
S 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 


/♦ - 

* 

*  SOURCE 

*  VERSION 

*  DATE 

*  AUTHOR 

*  DETAILS 

* 

« - 

*/ 


PROGRAM  INFORMATION  - 


complex . c 

1.6 

09  September  1991 

Jonathoin  E.  Hartman,  U.  S.  Naval  Postgraduate  School 
See  "complex. h". 


finclude  <stdio.h> 
tinclude  "complex. h" 


/♦ - function  definition  ========= - 

#ifdef  prototype 

void  cadd(Complex.Type  zl,  Complex^Tjrpe  z2,  Complex.Type  *sum) 
#else 

void  cadd(zl,  z2,  sum) 

Complex.Type  zl, 
z2. 
esum; 

#endil 

sum->x  =  zl.x  +  z2.x; 
sum->y  =  zl.y  +  z2.y: 

> 

/♦  End  caddO  - 

/* - ====r====  FUNCTION  definition  ========= - 

#ifdef  PROTOTYPE 


•/ 


*/ 


•/ 
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void  cdiv(Complex_Type  zl,  Complex.Typ*  z2,  CoBplex_Type  ♦quotient) 


#el8e 


void  cdiv(zl,  z2,  quotient) 

Conplex.Type  zl, 
z2, 

♦quotient; 


#endil 


double  d; 


il  (labs(z2.y)  <  labs(z2.x))  { 
d  =  (z2.y  /  z2.x) ; 

quotient->x  =  ((zl.x  +  zl.y  ♦  d)/(z2.x  +  z2.y  ♦  d)); 
quotient->y  =  ((zl.y  -  zl.x  ♦  d)/(z2.x  +  z2.y  ♦  d)); 

} 

else  { 

d  »  (z2.x  /  z2.y); 

quotient->x  =  ((  zl.y  +  zl.x  ♦  d)/(z2.y  +  z2.x  ♦  d)); 
quotient->y  =  ((-zl.x  +  zl.y  ♦  d)/(z2.y  +  z2.x  ♦  d)); 


/♦  End  cdivO 


FUICTIOI  DEFIIITIOI 


#ildel  PROTOTYPE 


void  CBul(CoHiplex_Type  zl,  CoBplex_Type  z2,  CoBplex.Type  eproduct) 


#else 


void  cbuKzI,  z2,  product) 

Complex.Type  zl, 
z2, 
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complex.c 


101  ^product ; 

102  tendif 

103  { 

104 

105  product->x  =  (zl.x  *  22.x  -  zl.y  ♦  22. y); 

106  product->y  =  (zl.x  ♦  22. y  +  2l.y  *  22. x); 

107  > 

108  /*  End  CBulO - */ 


109 

110 
111 
112 

113 

114  /♦ - =========  FUICTIOI  DEFIIITIOI  ========= - ♦/ 

115 

116 

117  #ifdei  PROTOTYPE 

118 

119  void  c8ub(Complex_Type  2I,  Complex_Type  22.  Complex.Type  vdiflerence) 

120 

121  #el8e 

122 

123  void  C8ub(zl,  22,  difference) 

124 

125  Complex.Type  2I , 

126  22, 

127  edifference; 

128  tendif 

129  ■( 

130 

131  diff erence->x  =  2I.X  -  z2.x; 

132  difference->y  =  2l.y  -  22. y; 

133 

134  } 

135  /*  End  C8ub(}  - ♦/ 

136 

137 

138 

139 

140 

141  /* - =========  FUICTIOI  DEFIIITIOI  ========= - */ 

142 

143 

144  #ifdef  PROTOTYPE 

145 

146  double  laCComplex.Type  2) 

147 

148  «el8e 

149 

150  double  Im(z) 
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Compl«x_Typ«  z; 


iendif 

{ 


r«tum(z.z) ; 


/*  End  Im() 


FUlCTION  DEFIMITIOH 


#ildef  PROTOTYPE 


double  Re(Complez_Type  z) 


telse 


double  Re(z) 


Complex.Type  z; 


tendif 

{ 


return(z.y) ; 


/*  End  Re() 


EOF  coBiplex.c 
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1  /♦ - PROGRAM  IIFORMATIOI  ========== - 

2  * 

3  *  SOURCE  epsilon. h 

4  *  VERSIOH  :  1.7 

5  *  DATE  09  September  199 a 

6  *  AUTHOR  Jonathan  E.  Hartman.  U.  S.  laval  Postgraduate  School 

7  * 


9 

10 

11 

12 

13 

14 

15 

16 
17 
IS 

19 

20 
21 
22 

23 

24 

25 

26 
27 
2S 

29 

30 

31 

32 

33 
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37 
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43 

44 

45 

46 

47 
4S 

49 

50 


- REFEREICES  ============== - 

[l]  Gragg,  Hilliam  B.  Personal  conversations,  course  notes,  and  MATLAB 
code,  1991. 


- -  descriptioh  ============= - 

This  file  contains  declarations  of  functions  that  determine  the  machine 
precision  for  a  particular  machine.  The  definition  of  epsilon  is  given 
below. 


==*=  LIST  OF  FUHCTIOHS  ============ 


epsdO 
epsf () 


/ 


- -  fumcTIOI  declaratioh  ========== - 

PURPOSE;  To  find  the  machine  precision.  The  machine  precision,  eps, 
is  defined  as  the  largest  number  which  satisfies: 

1.0  +  eps  ==  1.0 

This  program  uses  the  type  "double"  which  normally  means  an  8-byte 
(64-bit)  floating-point  number  stored  in  the  IEEE  754  double  precision 
standard  representation  of  [  1  sign  bit  ][  11-bit  exponent  ][  S2-bit 
mantissa/signif icand  ]  . 

IICLUDE;  "epsilon. h" 

RETURMS:  The  value  of  epsilon  (double). 
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51  •/ 

52 

53  double  epsdO; 


54 

55 

56 

57 

58 

59  /♦ - -  FUICTIOI  DEaARATIOI  ========== - 

60  * 

61  *  PURPOSE:  Tbia  function  is  identical  to  epsd()  except  that  it  returns 

62  *  type  float,  lote:  The  values  returned  may  be  identical, 

63  *  probably  reflecting  C  arithaetic  done  in  type  double 

64  *  regardless  of  the  ultimate  type  returned.  Anyuay,  this 

65  *  function  does  everything  using  type  float. 

66  ♦ 

67  ♦  IICLUDE;  "epsilon.h" 


68  * 

69  *  RETURNS:  The  value  of  epsilon  (float). 

70  ♦ 

72  ♦/ 

73 

74  float  epsfO; 

75 

76 

77  /• - ===s=====ssa  EOF  epsilon.h  ============ - */ 


266 


generate.h 


1  / 
2 

3 

4 
6 
6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 
23 

25 

26 

27 

28 

29 
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31 
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- program  iiformatioi  ========== - 

SOURCE  generate.h 

VERSIOI  :  1.7 

DATE  09  Septeaber  1991 

AUTHOR  Jonathan  E.  Hartaan,  U.  S.  laval  Postgraduate  School 

- -  REFEREICES  ============== - 

[1]  Gragg,  Villiaa  6.  Personal  conversations,  course  notes,  and  NATLAB 
codes,  1991. 


- -  descriptioi  =============== - 

Declarations  of  aatrix  and  vector  generation/initialization  functions. 


- ___________  list  of  FUICTIOMS  ============ 

hilbertO 
identity () 

initial_perBUtation_vector() 

axrandO 

uilkinsonO 

zeros  0 


- ==========  FUICTIOI  DECLARATIOI  ========== - 

PURPOSE:  This  function  generates  a  Hilbert  aatriz  of  the  specified 

size.  The  function  takes  care  of  aemory  allocation,  so 
the  caller  does  not  need  to  do  this.  The  definition  used 
for  a  Hilbert  aatrix  is  (for  rows  and  coluans  nuabered  from 
1)  that  the  element  at  the  (i,j)  position  has  the  value 
(l/(i  4  j  -  D). 

IICLUDE:  "allocate. h" 

"aatrix. h" 

CALLS;  aatallocO 

CALLED  BY; 

PARAMETERS:  The  paraaeters  tell  the  size  of  the  desired  aatrix. 

RETURHS;  On  success  (i.e.  no  allocation  problems),  hilbertO  returns 


2G7 


,  generate.h , 


th«  allocated  aatrix  filled  «ith  the  values  as  described. 
A  lULL  return  value  flags  an  allocation  failure. 

EXAMPLE:  Double_Matrix_Type  *k  =  hilbert(S,  7); 


51 

52 

53 

54 

55 

56 

57 

58 

59  #ifdef  PROTOTYPE 

60 


61 

Double.Natrix.Type  ehilbert(int  rows,  int  cols); 

62 

63 

«else 

64 

65 

Double_Matrix_Type  ehilbertO; 

66 

67 

#endif 

68 

69 

70 

71 

72 

V4 

75 

PURPOSE: 

This  function  generates  an  Identity  natrix  of  the  specified 

76 

* 

size.  The  function  takes  care  of  aemory  allocation,  so 

77 

♦ 

the  caller  does  not  need  to  do  this. 

78 

♦ 

79 

IICLUDE: 

"allocate. h" 

80 

♦ 

"aatrix.h" 

81 

* 

82 

* 

CALLS: 

■atallocO 

83 

84 

CALLED  BY: 

85 

* 

86 

PARAMETERS : 

The  paraaeters  tell  the  size  of  the  aatrix. 

87 

* 

88 

RETURIS : 

On  success  (i.e.,  no  allocation  probleas),  identityO 

89 

« 

returns  the  allocated  aatrix  filled  with  the  ones  on  the 

90 

diagonal.  A  lULL  return  value  flags  an  allocation  failure. 

91 

92 

• 

EXAMPLE: 

Double_Matrix_Type  *k  =  identity (E,  7); 

93 

* 

95 

•/ 

96 

97 

98 

#ifdef  PROTOTYPE 

99 

100 

Double.Matrix.Type  *identity(int  roes,  int  cols); 
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103 

104 

105 

106 

107 

108 
109 
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111 

112 

113 

114 

115 

116 

117 

118 

119 

120 
121 
122 

123 

124 

125 

126 

127 

128 

129 

130 

131 

132 

133 

134 

135 

136 

137 

138 

139 

140 

141 

142 

143 

144 

145 

146 

147 

148 

149 

150 


ielse 

Double.Natriz.Type  *idantity(} ; 
*«ndil 


/ 


FUICTIOI  DECLARATIOI 


PURPOSE;  To  initialize  a  permutation  vector,  pD .  Thie  function 

performs  allocation  for  pD ,  assuming  that  it  must  contain 
n  integer  elements.  Additionally,  the  function  assigns 
values  pCj]  =  j  for  all  0  <=  j  <  n.  If  allocation  fails,  p 
vill  be  lULL  upon  return. 


IICLUDE:  "allocate. h" 


CALLS:  intvecallocO 


CALLED  BY: 


PARAMETERS;  The  size  of  the  vector,  n. 
RETURBS:  (A  pointer  to)  The  vector. 


*/ 

#ifdef  PROTOTYPE 

int  *initial_permutation_vector(int  n); 

#else 

int  * initial .permutation. vector () ; 
tendif 


/♦ - -  FUICTIOI  DECLARATIOI  ========== - 

* 

*  PURPOSE:  This  function  generates  a  matrix  vhose  elements  are  pseudo- 

*  random  numbers  (generated  by  IcdrandO  in  mathz.c). 
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IICLUDE :  “allocate . h“ 

"■athx.h" 
"■atrix.h" 


CALLS: 


CALLED  BY: 


IcdrandO 
■atalloc ( ) 


PARAMETERS:  The  paranetere  tell  the  size  of  the  matrix. 

RETURIS:  On  eucceee  (i.e.,  no  allocation  problems),  mxrandO  returns 

the  allocated  matrix  filled  with  the  random  valnes.  A  BULL 
return  value  flags  an  allocation  failure. 

EXAMPLE:  Double_Matrix_Type  *A  =  mxrand(5,  7); 


#ifdef  PROTOTYPE 


Double.Matrix.Type  *nzrand(int  rovs,  int  cols); 


#else 


Double.Matrix.Type  emxrandO; 


tendif 


FUlCTlOl  DECLARATIOl 


PURPOSE:  This  function  generates  a  Wilkinson  matrix  of  the  specified 

size.  The  function  takes  care  of  memory  allocation,  so 
the  caller  does  not  need  to  do  this.  The  definition  used 
for  a  Wilkinson  matrix  is:  ones  along  the  diagonal,  ones 
along  the  rightmost  column,  zeros  in  the  upper  right 
triangle,  and  (-l)’s  in  the  lover  left  triangle. 

Cl  1  ] 

[-11  1  ] 

[-1-11  1  ] 

[-1-1-11  1  ] 
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ZICLUDE: 
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201 

202 

203 

204 

205 

206 

207 

208 

209 

210 
211 
212 

213 

214 

215 

216 

217 

218 

219 

220 
221 
222 


"■atrix.h" 

CALLS:  aatallocO 

CALLED  BY: 

PARAMETERS:  The  parameters  tell  the  size  of  the  matrix. 

RETURIS:  On  success  (i.e.  no  allocation  problems).  vilkinsonO 

returns  the  allocated  matrix  filled  vith  the  values  as 
described.  On  (allocation)  failure.  vilkinsonO  returns 
BULL. 


EXAMPLE:  Double_Matrix_Type  *A  =  silkinson(S.  7); 


/ 


223  #ifdef  PROTOTYPE 


224 

225  Double_Matrix_Type  evilkinsonCint  rovs.  int  cols); 

226 

227  «else 

226 

229  Double_Matrix_Type  *vilkinson() ; 

230 

231  #endif 

232 

233 

234 

235 

236 


237  /♦ - ==========  FUICTIOI  DECLARATIOB  ========== - 

238  * 

239  *  PURPOSE:  This  function  generates  a  matrix  of  the  specified  size, 

240  *  vhere  all  of  the  entries  are  zero. 

241  * 

242  ♦  IICLUDE:  "allocate. h" 

243  *  "matrix. h" 

244  ♦ 

245  *  CALLS:  matallocO 

246  ♦ 

247  *  CALLED  BY: 

248  ♦ 

249  •  PARAMETERS:  The  parameters  tell  the  size  of  the  matrix. 

250  • 
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251  *  RETURIS:  On  success  (i.e.  no  allocation  probleas) ,  zerosO  returns 

252  *  the  allocated  aatriz  filled  vith  zeros.  On  allocation 

253  *  failure,  zeros ()  returns  lULL. 

254  * 

255  *  EXAMPLE:  Double_Matrix_Type  *A  =  zeros(B,  7); 

256  * 

258  ♦/ 

259 

260  #ifdef  PROTOTYPE 

261 

262  Double_Natrix_Type  *zero8(int  roue,  int  cols); 

263 

264  telse 

265 

266  Double_Matrix_Type  ezerosO; 

267 

268  tendif 

269 

270 

271  /♦ - ===========  EOF  generate.h  =========== - ♦/ 


272 


273 
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51 

/* 

52 

S3 

PURPOSE: 

To  get  a  yes  or  no  answer  from  the  user. 

54 

♦ 

55 

♦ 

lOTE: 

This  function  includes  the  prompt  "(y/n)?  "  so  you  do  not 

56 

* 

have  to  include  this  in  your  query.  There  is  no  space 

57 

* 

before,  two  spaces  after,  and  no  newline  (i.e.  as  shown). 

58 

* 

59 

IICLUDE; 

<stdio.h> 

60 

"io.h" 

61 

♦ 

62 

CALLS: 

getcharO  <8tdio.h> 

63 

64 

CALLED  BY: 

65 

66 

* 

PARAMETERS: 

void. 

67 

0 

68 

0 

RETURKS : 

(int)  YES  or  10  (as  defined  in  matrix. h). 

69 

0 

71 

*/ 

72 

73 

74 

int 

answer () ; 

75 

76 

77 
7S 
79 
SO 


81  /* - ==r=======  FUICTIOI  DECLARATIOI  =======«= - 

82  • 

83  *  PURPOSE:  A  function  which  prompts  the  user  for  the  pertinent  data 

84  *  about  a  matrix  and  fills  the  structure  provided  with  the 

85  *  appropriate  information.  That  is,  this  function  allows  the 

86  *  user  to  input  the  values  of  the  elements. 

87  ♦ 

88  *  PARAMETERS:  A  pointer  to  the  structure  containing  the  matrix  to  be 

89  *  filled. 

90  * 

91  *  IICLUDE:  <stdio.h> 

92  ♦  "io.h" 

93  ♦ 

94  *  CAUTIOI:  This  function  ASSUMES  that  the  "rows"  and  "cols"  fields 

95  *  have  been  correctly  assigned  by  something  like  matallocO 

96  *  [see  "allocate. h"]  and  makes  no  effort  to  enter  a  value  in 

97  *  those  fields  of  the  matrix  structure. 

98  * 

99  *  CALLS:  () 

100  * 
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CALLED  BY: 


PARAMETERS:  The  paraneters  tell  the  size  of  the  aatrix. 

RETURIS:  The  aatriz  associated  sith  A  is  operated  on  during  the 

execution  of  the  function,  and  the  result  is  available 
upon  return. 

EXAMPLE:  if  ( !f ill_«atrix(AA)) - 


#ifdef  PROTOTYPE 


void  f ill_Biatrix(Double_Matrix_Type  *A); 


#else 


void  f ill_matrix() ; 


Aendif 


FUICTIOI  DEaARATIOI 


PURPOSE: 


IICLUDE: 


A  function  uhich  reads  data  from  a  file  and  stores  it  in 
the  Batrix  of  A.  This  function  takes  care  of  natrix 
allocation  for  the  caller. 

<otdio.h> 

"io.h" 


CAUTIOH;  This  function  ASSUMES  the  file  has  been  stored  in  the 
foraat  described  in  "Batrix.fmt'' . 

CALLS:  fgetsO 

fscanf 0 
revindO 

CALLED  BY: 

PARAMETERS:  The  pointer  to  the  matrix  structure  and  the  file  pointer. 

RETURRS:  1  on  success  and  0  on  any  sort  of  failure. 
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151 

152 

153 

154 

155 

156 

157 

158 

159 

160 
161 
162 

163 

164 

165 

166 

167 

168 

169 

170 

171 

172 

173 

174 

175 

176 

177 

178 

179 

180 
181 
182 

183 

184 

185 

186 

187 

188 

189 

190 

191 

192 

193 

194 

195 

196 

197 

198 

199 

200 


#ildel  PROTOTYPE 

int  lrflad_Batriz(Double_Matrix_Type  *♦!,  FILE  *lp) ; 

#alBe 

int  fread.aatrizO ; 

#endif 


FUICTIOI  DECLARATIOR 


PURPOSE:  A  function  which  vritaa  data  fron  A->BatrizD[]  to  a  file 

pointad  to  by  Ip. 


IICLUDE:  <stdio.h> 

"io.h" 


ASSUNPTIOR:  The  callar  has  already  perloraad  lopanO  on  Ip  lor  the 
"b"  (write)  node. 


CALLS; 


CALLED  BY: 


IprintK) 
rewind () 


PARAMETERS:  A  is  a  pointer  to  the  structure  which  contains  the  aatriz. 
Ip  is  a  FILE  pointer. 

RETURN'S ;  1  on  success  and  0  on  failure. 


/ 

#ifdel  PROTOTYPE 

int  fwrite_Batrix(Double_Natriz_Type  *k,  FILE  elp,  int  width,  int  alt); 

false 

int  Iwrite.BatrixO ; 
iendil 
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FUlCTIQI  DECLARATIOI 


FUICTIOI  DECLARATIOI 


io.h 


251 

/* 

252 

♦ 

253 

PURPOSE: 

254 

♦ 

255 

IICLUDE: 

256 

♦ 

257 

4> 

258 

« 

CALLS; 

259 

260 

261 

263 

•/ 

264 

265 

void  pause (); 

266 

267 

269 

* 

270 

PURPOSE: 

271 

* 

272 

* 

273 

IICLUDE: 

274 

* 

275 

276 

♦ 

CALLS : 

277 

« 

278 

PARAMETERS ; 

279 

* 

280 

* 

281 

♦ 

282 

* 

EXAMPLE; 

283 

* 

284 

* 

285 

287 

*/ 

288 

Press  a  key  to  continue! 

<stdio.h> 

"io.h" 

ftlushO 
get char () 
print! ( ) 


■=========r  FUkCTIOI  DECLARATIOI  ========== - 

This  function  provides  a  printout  of  the  infomation  stored 
in  the  structure  A. 

<stdio.h> 

"io.h" 

print! 0 

A  is  the  structure  that  contains  the  aatrix  to  be  printed. 
The  uidth  and  aft  values  are  described  near  the  top  of  this 
file.  The  defaults  are  defined  as  aanifest  constants. 

Double_IIatrir_Type  *A  =  hilbert(7,  5); 

printmd(*A,  LOIG.WIDTH,  LOIG.AFT); 


289  lifdef  PROTOTYPE 


290 

291  void  printind(Double_Matrix_Type  A,  int  width,  int  aft); 

292 

293  #else 

294 

295  void  printmdO ; 

296 

297  tendif 

298 


FUlCTIOl  OECLARATIOI 


PURPOSE:  This  function  prints  the  vsctor,  v,  of  doubles. 

IICLUDE:  <Btdio.h> 

•■io.h" 

CALLS:  printfO 

CALLED  BY; 

PARAMETERS:  v  is  the  vector,  size  is  the  nuaber  of  elements  in  vD . 


#ifdef  PROTOTYPE 


void  printvd (double  *v,  int  size,  int  width,  int  aft); 


#else 


void  printvdO: 


#endif 


FUICTIOR  DEaARATIOB 


PURPOSE;  This  function  provides  a  printout  of  the  integer  vector  v. 

IICLUDE:  <stdio.h> 

"io.h" 

CALLS:  printfO 

CALLED  BY: 

PARAMETERS;  v  is  a  vector  of  size  integers. 


lifdef  PROTOTYPE 


void  printvi(int  *v.  int  size,  int  width); 
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351 

352  telse 

353 

354  void  printviO; 

355 

356  #endil 

357 

358 

359 

360 

361  /* - ==i 


EOF 


io.h 
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mathx.h 


PROGRAM  IIFORNATIOI  == 


SOURCE 

VERSIOH 

DATE 

AUTHOR 


■athz . h 

1.2 

09  September  1991 

Jonathan  E.  Hartman,  U.  S.  laval  Postgraduate  School 


REFEREICES 


[1]  Knuth,  Donald  E.  The  Art  of  Computer  Programming,  Volume  2:  Semi- 

numerical  Algorithms.  Addis on-Vesley  Publishing  Company, 
Reading,  MA,  1969,  pp.  9-24. 

[2]  Sedgesick,  Robert.  Algorithms,  Second  Edition.  Addison-Vesley 

Publishing  Company,  Reading,  MA,  1988,  pp.  513-514. 


DESCRIPTIOI 


A  small  extension  to  the  usual  C  <math.h>. 


LIST  OF  FUMCTIOHS 


IcdrandO 
IclrandO 
multmod ( ) 
poH2() 


MAIIFEST  COISTARTS 


#ifndef  EXIT.FAILURE 
«define  EXIT.FAILURE 
fendif 


44 

•define 

START 

1234567 

/• 

starting  value,  Xo. 

See 

Cl] 

*/ 

45 

•define 

MULT 

31415821 

/• 

multiplier,  a. 

See 

Cl] 

♦/ 

46 

•define 

IlCR 

1 

/• 

increment ,  c . 

See 

Cl] 

*/ 

47 

•define  SQRTM 

10000 

/* 

sqrt (m) 

*/ 

48 

•define 

MODULUS 

100000000 

/* 

modulus,  m. 

See 

Cl] 

•/ 

2S1 
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FUlCTIOI  DECLARATIOV 


PURPOSE:  To  calculate  a  peaudo-randoB  auabar  in  the  range  [0,  1] 

ueing  the  linear  congruent ial  method.  This  function  ie  a 
very  eiaple  application  of  IclrandO.  It  aerely  divides 
the  value  that  IclrandO  returns  by  the  Bodulus,  and 
returns  the  resulting  double  value. 

IICLUDE;  ''mathx.h" 


CALLS: 


IclrandO 


CALLED  BY:  mxrandO 


“generate. c" 


PARAMETERS:  The  parameters  are  identical  to  those  for  IclrandO. 
RETURNS:  A  pseudo-random  double  value  in  the  range  [0.0.  1.0  ]. 

EXAMPLE:  double  d; 

d  =  IcdrandCSTART,  MULT,  IICR,  SQRTM,  MODULUS); 


#ifdef  PROTOTYPE 

double  Icdranddong  Xn,  long  a.  long  c,  long  sqrtm,  long  m) ; 

«else  /♦  iPSC/2  ♦/ 

double  IcdrandC/e  long  Xn,  long  a,  long  c,  long  sqrtm,  long  m  */}; 
#endif 


FUlCTIOI  DECLARATION 


PURPOSE:  To  calculate  a  pseudo-random  number  of  type  long  in  the 

range  [0,  (m-l)] ,  vhere  m  is  the  argument  for  modulus.  The 
algorithm  uses  the  linear  congruential  method.  This  method 
is  given  in  great  detail  in  [l] .  A  shorter,  algorithmic 
treatment  is  given  in  [2] .  I  have  tested  the  function  to 
be  sure  that  it  produces  the  ten  numbers  listed  on  page  S13 
of  [2]. 
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101 

102 

103 

104 

105 

106 
107 
106 

109 

110 
111 
113 
113 
IM 

115 

116 
117 
116 

119 

120 
121 
132 

123 

124 

125 
136 
127 

126 

129 

130 

131 

132 

133 


IICLUDE:  ••■athx.h" 

CALLS:  aultaodO 

CALLED  BY:  IcdrandO 

PARAKETERS:  Th«  notation  conaa  fron  (nora-or-loas) .  Zn  is  tha 

starting  aalua.  a  is  tha  nnltipliar.  c  is  tha  incranent. 
sqrtn  is  tha  sqnara  root  of  a,  which  is  tha  aodulus.  A 
nagativa  valua  for  anj  of  tha  argnaants  is  iapossibla  and 
will  invoka  tha  dafaults  givan  aaong  tha  aanifast  constants 
abova.  Tha  starting  walua,  Zn.  is  tha  axcaption.  If  you 
supply  a  nonnagativa  walua,  your  walua  will  ba  accaptad  as 
tha  starting  walua.  Elsa,  tha  starting  walua  BEGIIS  at  the 
default  START  and  is  changed  each  tine  the  function  is 
called  (as  long  as  tha  starting  walua  arguaent.  Zn,  is 
nagatiwa).  That  is.  Zn  BAS  MEMORY  as  long  as  your  progran 
is  running.  Tha  other  paranatars  are  dataminad  fron  call- 
to-call. 

RETURNS;  A  pseudo-randon  long  in  tha  range  [  0,  (n-1)  ],  where  a  is 
tha  aodulus  arguaent. 

EZAMPLE:  This  axanpla  illustrates  tha  use  of  the  default  waluas: 

long  1; 

1  =  IclrandC START,  MULT.  IICR,  SQRTM,  MODULUS): 


/ 


134  #ifdef  PROTOTYPE 


long 


135 

136 

137 

136  Balsa  /* 

139 

140  long 

141 

142  fandif 

143 

144 

145 

146 

147 
146 

149 

150 


IclraindClong  Zn,  long  a, 
iPSC/2  •/ 

IclrandC/*  long  Zn,  long 


long  c.  long  sqrtn,  long  a); 


a,  long  c,  long  sqrta,  long  a  */) ; 
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15,  /♦ - -  FUICTIOI  DECLllUTIOI  xsa====s= - 

;sj  • 

.53  •  PURPOSE:  To  calculate  (a  •  b)  aod  **2,  while  trying  to  avoid  over- 

154  •  flow.  This  function  ie  adapted  froa  Sedgevick’s  ‘ault’ 

155  •  function  on  page  613  of  Cl] . 

156  • 

157  *  IICLUDE:  "aathx.h" 

158  • 

159  *  CALLS: 

160  • 

161  •  CALLED  BY:  IclrandO 

162  * 

163  •  PARAMETERS:  long  a.  b,  a. 

164  • 

165  *  RETURIS:  long  (a  *  b)  aod  b*2. 

166  • 

168  •/ 

169 

170 


171  #ifdef  PROTOTYPE 

172 

173  long  BultBoddong  a,  long  b,  long  a); 

174 

175  Aelse 

176 

177  long  aultmodC/*  long  a,  long  b,  long  a  ♦/); 

178 

179  fendif 

180 
181 
182 

183 

184 


185  /♦ - =========  FUICTIOI  DECLARATIOI  ========= - 

186  • 

187  •  PURPOSE:  To  calculate  the  value  of  two  raised  to  the  (n)  power.  This 

188  «  function  [unlike  the  aacro  P0W2()  given  in  aacros.h]  will 

189  *  handle  the  case  where  (n  -=  0) .  This  function  uses  left 

190  *  shifts  to  achieve  the  result,  so  if  you  ask  for  too  large  a 

191  •  value,  the  result  is  not  guaranteed.  The  value  of  n  is 

192  *  ASSUMED  to  be  a  POSITIVE  integer. 

193  ♦ 

194  *  IICLUDE:  "aathz.h" 

195  • 

196  *  CALLS: 

197  ♦ 

198  •  CALLED  BY: 

199  ♦ 

200  *  PARAMETERS:  The  desired  poser  of  two,  n. 
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mathx.h 

201  ♦ 

202  •  RETURHS:  The  Tunction  returns  the  value  ol  2*(n) 

203  • 

205  */ 

206 

207 

208  #if<lef  PROTOTYPE 

209 

210  long  po82(int  n); 

211 

212  felse 

213 

214  long  poB2(/*  int  n  */); 

215 

216  «endif 

217 

218 

219 

220 
221 

222  /* - ==============  EOF  mathx.h  ========= 
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1  /* - ==========  PROGRAM  IIFORMATIOI  ========== - 

2  * 

3  *  SOURCE  nuin_8ys.h 

4  *  VERSIOI  :  1.4 

5  *  DATE  09  Septanber  1991 

6  *  AUTHOR  Jonathan  E.  Hartman,  U.  S.  laval  Postgraduate  School 

7  * 

8  * 

9  * - REFEREICES  ============== - 

10  * 

11  *  [1]  Goldberg,  David,  "Vhat  Every  Computer  Scientist  Should  Knos  About 

12  *  Floating-Point  Arithmetic."  ACM  Computing  Surveys,  Vol.  23, 

13  ♦  Ho.  1,  March,  1991,  pp.  6-48. 

14  * 

15  *  [2]  Hayes,  John  P.  ''Computer  Architecture  and  Organization."  NcGrau- 

16  *  Hill  Book  Company,  leu  York,  Second  Edition,  1988,  p.  196. 

17  ♦ 

18  ♦ 

19  * - ==============  DESCRIPTION  ============= - 

20  ♦ 

21  *  The  "num.sys"  group  ol  functions  relate  to  number  systems  (e.g.  binary, 

22  *  decimal,  hexadecimal). 

23  ♦ 

24  * 

25  e - ===========  list  of  FUICTIOIS  ============ - 

26  ♦ 

27  ♦  binrepO 

28  •  binvecO 

29  •  hexrepO 

30  «  ieeerepO 

31  * 

33  ♦/ 

34 

35 

36  /* - ==========  FUICTIOH  DECLARATIOI  ========== - 

37  ♦ 

38  *  PURPOSE:  To  display  the  binary  representation  of  a  number.  Given  the 

39  4  parameters  described  belov,  binrepO  prints  the  binary 

40  *  representation.  For  numbers  of  type  double,  type  float,  or 

41  4  binrepO  reverses  the  order  of  the  bytes  from  the 

42  4  machine  storage.  This  makes  them  more  readily  recognizable 

43  4  as  [  SIGH  ] [  EXPOHEHT  ] [  MANTISSA  ]  for  the  floating-point 

44  4  types  and  orders  the  bytes  in  order  of  decreasing  signifi- 

45  4  cance  for  the  integers. 

46  4 

47  4  INCLUDE:  "num.sys.h" 

48  4 

49  4  CALLS : 

50  4 
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num-sys.h 


51 

52 

53 

54 

55 

56 

57 

58 

59 

60 
61 
62 

63 

64 

65 

66 

67 

68 

69 

70 

71 

72 

73 

74 

75 

76 

77  / 


CALLED  BY: 

PARAMETERS:  The  function  needs  to  knos  vhat  type  of  niuiber  you  are 
sending  in,  so  use  the  types  given  in  Batriz.h.  The 
function  understands  TYPE.CHAR,  TYPE.DOUBLE,  TYPE.FLOAT, 
and  TYPE.IIT) .  It  also  needs  a  pointer  to  the.number. 

EXAMPLE:  float  f; 

binrep(TYPE_FLOAT.  Af); 

/ 

#ifdof  PROTOTYPE 

void  binrepCint  niunber.type ,  void  ethe.nuaber) ; 
telse 

void  binrepO ; 

Aendif 


FUVCTION  DEaARATIOM 


78 

« 

79 

* 

PURPOSE: 

80 

81 

82 

* 

83 

* 

IICLUDE: 

84 

* 

85 

CALLS : 

86 

* 

87 

* 

CALLED  BY: 

88 

« 

89 

CAUTIOI : 

90 

91 

92 

* 

93 

* 

94 

95 

PARAMETERS 

96 

* 

97 

98 

99 

100 

• 

RETURIS : 

To  expand  the  bits  of  the  input  into  an  array  of  integers. 
The  array  only  holds  zeros  and  ones,  with  each  element 
representing  a  bit  of  the  input  number. 

"num.sys .h" 


This  function  returns  the  bits  AS  THEY  ARE  II  THE  MACHIIE! 
Many  machines  store  type  double,  type  float,  and  type  int 
so  that  thair  bytes  are  in  an  order  that  is  the  reverse  of 
vhat  you  might  expect.  Of  course,  the  bits  vithin  a  byte 
are  in  the  expected  (msb . Isb)  order. 


sending  in,  so  use  the  types  given  in  matrix. h.  The 
function  recognizes  TYPE.CHAR,  TYPE.DOUBLE,  TYPE.FLOAT,  and 
TYPE.IIT.  It  also  asks  for  a  pointer  to  the  number. 

A  pointer  to  int.  The  function  vill  take  care  of  allocation 
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101 

102 

103 

104 

105 

106 

107 

108 

109 

110 
111 
112 

113 

114 

115 

116 

117 

118 

119 

120 
121 
122 

123 

124 

125 

126 

127 

128 

129 

130 

131 

132 

133 

134 

135 

136 

137 

138 

139 

140 

141 

142 

143 

144 

145 

146 

147 

148 

149 

150 


lor  this  pointer,  and  it  vill  till  the  array  with  the  bits 
ol  the  number.  For  indexing  purposes,  you  will  probably 
need  to  know  how  big  this  vector  is.  Multiply  the 
[sizeoKtype  you  are  sending  in}]  by  8  (bits/byte).  That’s 
how  many  elements  will  be  in  the  returned  vector  ol  integer 
(bits) .  This  pointer  will  be  lULL  il  there  was  an  alloca¬ 
tion  problem. 

EXAMPLE: 

lloat  1;  Assume  that  this  takes  4  bytes  •  8  bits 

int  vv;  To  hold  the  bit  vector  ol  1  (32  elements) 

V  =  binvec(TYPE_FL0AT.  *1); 


♦/ 

#ildel  PROTOTYPE 

int  *binvec(int  number_type,  void  *the_number) ; 

*else 

int  •binvecO; 
fendil 


/ 


FUICTIOf  DECLARATIOI  ===> 


PURPOSE:  To  display  the  hexadecimal  representation  ol  a  number. 

IICLUDE:  "num.sys.h" 

CALLS : 

CALLED  BY: 


PARAMETERS:  The  lunction  needs  to  know  what  type  ol  number  you  are 
sending  in,  so  use  the  types  given  in  matrix. h.  The 
lunction  recognizes  TYPE.CEAR,  TYPE.DOUBLE,  TYPE.FLOAT,  and 
TYPE.IIT.  It  also  needs  a  pointer  to  the  number. 

EXAMPLE:  lloat  1; 

printlC'The  hexadecimal  representation  ol  y.l  is:  ",  1); 
hexr op (TYPE.FLOAT,  »1); 


2SS 


#ild*f  PROTOTYPE 


void  hexrepCint  number .type ,  void  vthe.number) ; 


#el8e 


void  hezrepO; 


#endil 


FURCTIOI  DECLARATIOI 


PURPOSE: 


IICLUDE: 


CALLS ; 


CALLED  BY: 


PARAMETERS : 


EXAMPLE: 


To  display  binary  and  IEEE  representation  of  a  number.  This 
is  nearly  a  tutorial  function!  It  displays  a  binary  repre¬ 
sentation  of  the  number,  and  then  breaks  out  the  sign, 
exponent,  and  mantissa  (or  signif icand) .  Some  terse  trans¬ 
lation  tips  are  also  provided. 

"num.sys .h" 


The  function  needs  to  knov  vhat  type  of  number  you  are 
sending  in,  so  use  the  types  given  in  matrix. h.  This 
function  OILY  recognizes  the  floating-point  types  (i.e., 
TYPE.DOUBLE  emd  TYPE.FLOAT) .  It  also  needs  a  pointer  to 
the  number. 

float  f ; 

printf("The  IEEE  754  representation  of  '/.f  is:  ",  f); 
ieeerep (TYPE.FLOAT.  *f ) ; 


#ifdef  PROTOTYPE 


void  ieeerep(int  number.type,  void  vthe.number) ; 


telse 


void  ieeerepO; 
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num-sys.h 


201  tandif 

202 

203 

204  /* - ============  EOF  nuB.ays.h 
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,  ops.h , 


1  /* - -  program  iiformatioi  ========== - 

2  ♦ 

3  *  SOURCE  ops.h 

4  *  VERSIOH  :  1.7 

5  *  DATE  09  September  1991 

6  *  AUTHOR  Jonathan  E.  Hartman.  U.  S.  laval  Postgraduate  School 

7  * 


9 

10 

n 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 


- ==ss::=======s===  REFEREICES  ============== - 

[1]  Golub,  Gene  H.,  and  Charles  F.  VanLoan.  Matrix  Computations.  The 
Johns  Hopkins  University  Press.  Baltimore,  1989. 


- -  descriptioi  ============= - 

The  functions  declared  belov  perform  matrix  and  vector  operations.  For 
the  sake  of  brevity.  I  sill  often  use  simple  (MatLab-style)  notation  in 
comments.  For  instance,  x*  means  x  transpose  (i.e.  a  rou) .  Do  not 
confuse  the  comment  shorthand  vith  what  is  really  happening  in  the 
code.  My  goal  is  to  get  function  specifications  across  clearly  and 
succinctly  without  excessive  concern  for  implementation.  Here  are  a 
fas  notes. 

An  operation  preceded  by  a  means  “elementwise".  For  instance, 

X  .4  y  means  the  elementwise  vector  multiplication  of  x  by  y.  That  is, 
the  result  would  be  some  vector  z  like: 

z’  *  C  x[l]*y[l],  x[2]4y[2],  - ,  x[n]*y[n]  ] 

If  the  operation  appears  without  the  preceding  it  means  the  vector 

operation. 


===  LIST  OF  FUHCTIOHS  ============ 


colsO 

dot .product () 
matrix.productO 
max.elementO 
normpO 

out or .product ( ) 

rowsO 

swap.colsO 

swap.rowsO 

vec.initO 


/ 
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51  /* - ==========  FUlCTIOI  DECLARATIOI  ==========— 

62  ♦ 

53  *  PURPOSE;  To  return  the  number  of  columns  in  the  matrix  A. 

64  ♦ 

55  *  IICLUDE:  "ops.h" 

56  * 

58  */ 

59 


60  #ifdof  PROTOTYPE 

61 

62  int  cols(Double_lIatrix_Type  *k); 

63 

64  #else 

65 

66  int  cols(/*  Double_Matrix_Type  *k  ♦/); 

67 

68  #endif 


69 

70 

71  /* - =========  FUICTIOM  DECLARATIOI  ========= - 

72  • 

73  4  PURPOSE;  Computes  the  dot  product  of  the  input  vectors  x  and  y  shich 

74  4  is  defined  in  [l]  (page  4).  The  dot  product  of  x  and  y  is 

75  4  X’  4  y. 

76  4 

77  4  PARAMETERS:  The  vectors  x  and  y  should  bo  arrays  of  typo  double,  each 

78  4  having  "size"  elements. 

79  4 

80  4  IICLUDE:  "ops.h" 

81  4 

82  4  CALLS:  I/A 

83  4 

84  4  CALLED  BY;  matrix.productO  Caee  below] 

85  4 

86  4  RETURIS:  A  double  (scalar)  value  equal  to  the  dot  product  x'  *  y. 

87  4 

88  4  EXAMPLE:  The  following  example  would  conclude  with  answer  ==  10.0. 

89  4 

90  4  double  answer; 

91  4 

92  4  static  double  x[]  =  {  1.0,  2.0,  3.0  >, 

93  4  yn  =  <  3.0,  2.0,  1.0  >; 

94  4 

95  4  int  size  =  3; 

96  4 

97  4  answer  =  dot_product(x,  y,  size); 

98  4 

100  4/ 
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#ild*l  PROTOTYPE 


double  dot .product (double  '*x,  double  *j,  iut  eize); 


telee 


double  dot_product(/*  double  *x,  double  *y,  int  size  ♦/); 


#endil 


FUNCTIOV  OECLARATIOI 


PURPOSE:  To  Bultiply  matrices  A  and  B,  placing  the  product  in  C. 


IICLUDE:  "ops.h" 


CALLS; 


CALLED  BY: 


dot.product 


[see  above] 


PARAMETERS;  The  parameters  tell  the  size  of  the  matrix. 

RETURNS;  SUCCESS  if  the  matrices  sere  compatible  for  multiplication 

and  C  contained  enough  space  to  contain  the  entire  result. 
FAILURE  if  A  and  B  were  incompatible  or  C  uas  not  big 
enough  to  hold  the  product.  The  values  for  SUCCESS  and 
FAILURE  are  given  in  'matrix. h’. 

EXAMPLE;  Double.Matrix.Type  *k, 

•B, 

*c: 

if  (matrix_product(A.B,C)  ==  FAILURE)  { 

printf ("matrix_prodttct(A,B,C)  failed. \n") ; 
exit(EXIT_FAILURE); 

} 

else  { 

printf ("C  contains  A  *  B.\n"); 
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151  PROTOTYPE 

152 

153  int  aatrix.productCDouble.Matriz.Tjpe  *k, 

154  Double.Natriz.Tjpa  *B, 

155  Doubl«_Matrix_Typa  *C) ; 

156  telae 

157 

158  int  natrix.productO : 

159 

160  fend  if 

161 
163 
163 


164  / 

-==========  FUICTIOI  DECLARATIOI  ========== - 

165 

166 

167 

165 

169 

PURPOSE: 

To  search  the  elements  belov  and  to  the  right  of  A(k,k)  lor 
the  element  that  is  maximum  in  absolute  value. 

IICLUDE; 

<math.h>  [link  using  -Im  if  necessary] 

170 

"ops.h" 

171 

172 

CALLS: 

fabsO 

173 

174 

CALLED  BY: 

175 

176 

PARAMETERS : 

A  is  the  matrix  (stracture) .  k  is  the  index  for  a  position 

177 

on  the  main  diagonal,  A(k,k).  The  search  mill  be  conducted 

178 

for  the  area  of  the  matrix  that  lies  below  k  and  to  its 

179 

right : 

180 

181 

(k.k) - > 

182 

1  This  is  the  area  that  will  be  searched 

183 

1  for  an  element  of  maximum  absolute  value. 

184 

1  The  search  does  lOT  include  row  k  nor 

185 

1  does  it  include  column  k. 

186 

187 

Parameters  must  also  include  s,  the  address  of  an  integer 

188 

that  sill  contain  the  rov  number  for  the  maximum  element 

189 

upon  return;  and  t,  an  address  of  an  integer  to  store  the 

190 

column  number  for  the  maximum  element. 

191 

192 

lOTE: 

To  search  the  WHOLE  MATRIX,  the  parameter  k  should  be  (-1) . 

193 

The  values  of  k,  s,  and  t  should  be  interpreted  as  the  C 

194 

versions  of  indexes  (i.e.  beginning  with  0). 

195 

196 

RETURHS : 

The  function  returns  the  muimum  (in  absolute  value) 

197 

element  found  in  A  (type  double).  Additionally,  the  index 

198 

values  for  this  element  are  placed  in  the  variables  pointed 

199 

to  by  s  (row)  and  t  (col) . 

200 
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201 

*  EXAMPLE: 

202 

♦ 

203 

*  Double. 

Matrix.Type  *A: 

204 

205 

*  double 

u; 

206 

207 

*  int 

k. 

208 

♦ 

a. 

209 

* 

t; 

210 

* 

211 

*  u  =  Bax_eleBent(A,  k. 

212 

♦ 

213 

214 

*/ 

215 

216  #ildel  PROTOTYPE 

217 

218  double  Bax_eleiBent(Double_Matrix_Typ«  **,  int  k,  int  *8,  int  *t); 

219 

220  false 

221 

222  double  nax.alamentO ; 

223 

224  fendil 

225 

226 
227 


228  / 

229 

230 

231 

232 

233 

234 

235 

236 

237 

238 

239 

240 

241 

242 

243 

244 

245 

246 

247 

248 

249 

250 


- -  fUICTIOI  declaratioi  *======== - 

PURPOSE:  Computes  the  p-nora  of  the  input  vector  x  defined  in  [l] 

(page  63). 

IICLUDE:  <Bath.h> 

"ops.h" 

CALLS:  fabsO 

CALLED  BY: 

PARAMETERS:  x  is  the  vector.  It  Bust  contain  "size"  elsBents  of  type 
double.  The  p  argument  is  the  p  of  p-norv. 

RETURIS:  A  double  (scalar)  value  equal  to  the  p-norm  of  x. 

EXAMPLE: 

static  double  x[]  =  ■(  1.0,  2.0,  3.0  >; 
double  Euclidean.norm.of.x; 


295 


ops.h 


351  *  Euclidean_nora_of_z  =  nornpCz.  2,  3); 

352  ♦ 

254  */ 

355 


356 

tifdei  PROTOTYPE 

257 

358 

double  nonip(double  *z,  int  p,  int 

size) ; 

359 

360 

false 

361 

262 

double  nompO; 

363 

364 

fendif 

265 

266 

268 

* 

269 

*  PURPOSE : 

To  place  the  outer  product  of  z  and  y  in  C. 

270 

* 

271 

♦  IICLUDE: 

"ops.h" 

272 

♦ 

273 

*  CALLS ; 

i/A 

274 

* 

275 

♦  CALLED  BY : 

I/A 

276 

* 

277 

*  ASSUMPTIOM: 

The  aatriz  associated 

uith  C  is  already  allocated  to  the 

278 

• 

proper  size. 

279 

280 

*  PARAMETERS : 

Tho  vectors,  z  and  y. 

of  sizes  z.size  and  y.size;  and  the 

281 

■atriz  associated  uith  C  to  accept  the  outer  product. 

282 

0 

283 

*  RETURMS : 

The  aatriz  associated 

with  C  is  filled  uith  the  proper 

284 

values . 

285 

* 

2d7 

*/ 

388 

289 

290 

fifdeY  PROTOTYPE 

291 

292 

void  outer.product (double  ♦z,  int  z. 

.size,  double  *j,  int  y.size. 

393 

double  **C) ; 

394 

false 

395 

396 

void  outer_product() ; 

297 

298 

fendif 

399 

300 
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301  /* - ==========  FUICTIOI  DECLARATIOI  ========== 

302  • 

303  *  PURPOSE:  To  return  the  nuBber  of  roes  in  the  aatriz  A. 

304  * 

305  •  IICLUDE:  "ops.h" 

306  ♦ 

307  ♦ - ===========  =  =  ======================  ======  =  ====. 

306  •/ 

309 


310  #ifdef  PROTOTYPE 

311 

312  int  roHs(Doublo_Ratrix_Type  "A); 

313 

314  #else 

315 

316  int  roBsO; 

317 

316  Aendif 


319 

320 

321 

322  /* . . ==========  FUUCTIOli  DECLARATIOI  ========== . . 

323  ♦ 

324  *  PURPOSE;  To  seap  columns  p  and  q  in  the  matrix  contained  within  A. 

325  • 

326  •  IICLUDE:  "ops.h" 

327  * 

326  •  CALLS:  H/A 

329  ♦ 

330  *  CALLED  BY; 

331  ♦ 

332  *  PARAMETERS;  A  is  the  structure  holding  the  matrix.  The  integers  p  and 

333  *  q  are  the  colusm  numbers  to  be  swapped.  Indexes  are 

334  *  numbered  according  to  the  C  convention  (beginning  at  zero) . 

335  ♦ 

336  *  RETURIS;  Upon  return,  the  columns  have  been  swapped  in  A. 

337  * 

339  */ 

340 


341  #ifdef  PROTOTYPE 


342 

343  void  swap_cols(Double_Natrix_Type  *k,  int  p,  int  q) ; 

344 

345  #else 


346 

347  void  swap.colsO; 
346 


349  Aendif 


350 
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383  / 

384 

385 

386 

387 

388 

389 

390 

391 

392 

393 

394 

395 

396 

397 

398 

399 

400 


- =========  FUICTIOI  DECLARATIOM  ========= - 

PURPOSE:  To  initialize  the  vector  v  of  n  integers  sith  the  values 

li  2,  3,  •••■  n. 

IICLUDE;  "ops.h" 

CALLS : 

CALLED  BY: 

ASSUNPTIOI:  The  vector,  v,  has  already  been  successfully  allocated  as 
an  array  of  n  integers. 

PARAMETERS:  The  vector,  v,  to  be  initialized;  and  its  size,  n. 

RETURIS:  The  vector’s  eleaents  are  set  to  the  new  values  and  these 

values  are  in  v[]  upon  return. 
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401  * 

402  * - ==  =  =  =  ======  =  =  ===========5 

403  ♦/ 

404 

405 

406  #ifd«l  PROTOTYPE 

407 

408  Toid  T*c_init(int  •v,  int  n) ; 

409 

410  telse 

411 

412  void  v«c_init(); 

413 

414  fendif 

41$ 

416 

417  /♦ - =  =  =  =  =  =  =  =  ===  =  =  =  EOF  ops. 
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PROGRAM  IIFORNATIOI 


SOURCE 

VERSIOI 

DATE 

AUTHOR 


tiBing.h 

1.2 

09  Sspteabsr  1991 

Jonathan  E.  Hartnan,  U.  S.  laval  Postgraduate  School 


REFEREICES 


REFEREICES  : 


[1]  IniBOS.  The  Transputer  Databook.  Second  Edition,  1989. 

[2]  Intel.  iPSC/2  Prograsoier’s  Reference  Manual. 


DESCRIPTIOH 


This  file  contains  definitions  of  manifest  constants,  type  definitions, 
and  function  declarations  for  tine-related  tasks  on  the  Intel  iPSC/2  or 
a  network  of  Inaos  transputers. 


LIST  OF  FUICTIOIS  ====«=**====- 


clockO 
delay 0 


MAIIFEST  COISTAITS 


3a  *ifdef  TRAISPUTER 
39 


40  fdefine 

LO.PERIOD 

64.0e-6 

/* 

period  of  low  priority  clock 

*/ 

41  fdefine 

HI.PERIOD 

l.Oe-6 

/* 

period  of  high  priority  clock 

•/ 

42  #define 

LO.FREQ 

1S62B.0 

/* 

frequency  of  low  priority  clock 

•/ 

43  fdefine 

HI.FREQ 

1.0e6 

/* 

frequency  of  high  priority  clock 

•/ 

44 

45  «else  /*  iPSC/2  */ 

46 

47  Rdefine  M.PERIOD 

48  «define  M.FREQ 

49 

50  #endif 


l.Oe-3  /*  period  of  Intel's  aclockO 
l.Oe-3  /*  frequency  for  Intel's  aclockC) 
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51 

52 

53  /♦ - ============  type  DEFIIITIOIS  =*=====»==== - 

54  * 

55  *  The  tjpe  'ticks'  is  delined  in  sn  effort  to  sake  tiaing  a  bit  sore 

56  e  transparent  across  the  aachines  listed. 

57  * 

59  */ 

60 

61  tifdef  TRAISPUTER 


62 

63 

64 

65 

66 
67 
66 

69 

70 

71 

72 

73 

74 

75  / 


typedef  int  ticks; 

*else  /*  iPSC/2  ♦/ 
typedef  unsigned  long  ticks; 
kendif 


76 

77 

PURPOSE: 

78 

79 

IICLUDE: 

80 

81 

82 

CALLS : 

83 

84 

85 

CALLED  BY; 

86 

87 

PARAMETERS : 

88 

89 

RETURIS ; 

90 

91 

92 

93 

EXAMPLE; 

94 

95 

96 

98 

/ 

99 

100 

.=========  FUICTIOI  DECLARATIOI  *==****== - 

To  get  the  time  (in  ticks)  from  the  processor’s  clock. 

(Logical  Systems  C,  version  89.1) 


<conc.h> 

"timing.h" 

TimeO 

mclockO 


(Logical  Systems  C,  version  89.1) 
(Intel  iPSC/2  C) 


The  function  samples  the  clock  and  returns  ticks.  More 
information  on  ticks,  period,  and  frequency  is  given  in  the 
definitions  above. 

ticks  t[2] ; 

t[0]  =  clockO; 
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101  #ifdef  PROTOTYPE 

102 

103  ticks  clock(void): 

104 

105  *slse 

106 

107  ticks  clock(/*  void  ♦/); 

108 


109 

tendif 

110 

111 

112 

113 

114 

115 

UbULAKAilUl 

116 

• 

117 

« 

PURPOSE: 

To  force  a  delay  of  at 

least  a  given  amount 

(in  seconds)  in 

118 

program  execution. 

119 

120 

IICLUDE: 

<conc.h> 

(Logical  Systems  C, 

version  89.1) 

121 

* 

"timing.h" 

122 

123 

* 

CALLS: 

ProcGetPriority ( ) 

(Logical  Systems  C, 

version  89.1) 

124 

TimeO 

(Logical  Systems  C, 

version  89.1) 

125 

* 

mclockO 

(Intel  iPSC/2  C) 

126 

127 

CALLED  BY: 

128 

« 

129 

PARAMETERS: 

The  (float)  argument  tells  the  function  the 

minimum  time 

130 

* 

(in  seconds)  to  delay. 

131 

* 

132 

* 

EXAMPLE; 

delay(1.25); 

133 

♦ 

134 

135 

•/ 

136 

137 

tifdef  PROTOTYPE 

138 

139  Toid  delayCYloat  seconds); 

140 

141  ielse 

142 

143  void  delayC  /*  float  seconds  «/  ); 

144 

145  tendif 

146 

147 

148  /• - ============= 


*/ 


EOF  tining.b 


E.  GAUSS  FACTORIZATION  CODE 


The  Gauss  factorization  code  appears  on  the  pages  that  follow.  First,  the  code 
for  partial  pivoting  is  given.  Since  the  complete  pivoting  case  was  very  similar,  most 
of  it  has  been  omitted  to  save  space.  The  pivot  election  function,  however,  is  shown 
in  a  fragment  of  gfpcnode.c,  the  node  code  for  GF  with  Pivoting  (Complete). 
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gfpp.mak 


1  « - 

3  « 

3  *  PURPOSE  :  MaXelile  for  Hyp«rcub«  Gauss  Factorization  (GF)  Program 

4  #  AUTHOR  Jonathan  E.  Hartman.  U.  S.  laval  Postgraduate  School 

5  «  DATE  26  August  1991 

6  » 

7  # - 

8 

9  ROOTCODE=gfpphost 

10  IQDECODE^gfppnode 

11  HEADER=gf 

12  IIF_FILE=gfpp 

13 

14 

15  # -  OPTIOIS  AID  DEFIIITIOIS  - 

16  « 

17  #  iPSC/2  Section  (NOIR  ==  MatLib  directory) 

18 

19  KDIR=/usr/hartman/matlib/ 

20 
21 

22  #  Transputer  Section 

23  * 

24  #  The  following  section  establishes  options  and  definitions,  starting 

25  *  with  PP.  the  Logical  Systems  C  Preprocessor.  The  '-dX’  option  (with  no 

36  *  macro.ezpression)  is  like  'tdefine  Z  1*.  Rezt  the  compilation  options 

27  #  for  Logical  Systems'  TCX  Transputer  C  Compiler  are  given.  The  '-c' 

38  *  means  compress  the  output  file.  The  options  beginning  eith  ‘-p'  tell 

29  #  TCX  to  generate  code  for  the  appropriate  processor: 

30  # 


31 

« 

-p2 

T212 

or 

T222 

32 

« 

-p25 

T225 

33 

« 

-p4 

T414 

34 

« 

-p45 

T400 

or 

T425 

35 

« 

-p8 

T800 

36 

« 

-p86 

T801 

or 

T805 

37  # 

38  *  Logical  Systems’  TASK  Transputer  Assembler  is  next.  The  '-c'  means 

39  *  compress  the  output  file  (it  can  cut  it  in  half)!  The  '-t'  is  used 

40  #  because  the  input  to  TASK  vill  be  from  a  language  translator  (TCX’s 

41  *  output)  and  not  from  assembly  source  code. 

42  * 

43  9  The  final  list  tells  TLHK  which  libraries  to  look  at  during  linking. 

44  #  It  also  establishes  an  entry  point.  He  use  '.main'  for  the  root  node 

45  i  and  ‘.ns.main’  for  other  nodes. 

46 

47  PP0PT2=-dPR0T0TY?E  -dTRAISPUTER  -dT212 

48  PP0PT4=-dPR0T0TYPE  -dTRAISPUTER  -dT414 

49  PP0PT8=-dPR0T0TYPE  -dTRAISPUTER  -dTSOO 

50  TCX0PT2=-cp2 
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51  TCX0PT4=-cp4 

52  TCX0PT8=-cp8 

53  TASMOPT=-ct 

5<  T2LIB=t21ib.tll 

55  T4LIB=*atlib4.tll  t41ib.tll 

56  T8LIB=Batlib8.tll  tBlib.tll 

57  REITRy=_Bain 

58  IEITRY=_ns_inain 

59 

60 

61  # -  DEFAULT  ===>  MAKE  ALL 

62  * 


63 

64 

« 

* 

CoBBent 

out  one  or  the  other 

65 

« 

all: 

ipse 

66 

« 

run: 

irun 

67 

« 

clean: 

ielean 

68  all :  transputer 

69  run:  trim 

70  clean:  tclean 

71 

72 

73 

74 

75  # -  ROOT  CODE  - 

76  # 

77  #  iPSC/2  Section 

78 

79  ipse:  SCROOTCODE)  KlODECODE) 

80 

81  KROOTCODE):  $(R00TC0DE)  .o 

82  cc  KROOTCODE)  .0  KMDIR) allocate .o  $(NDIR)cl2a'gs.o  KMDIR)coDuiihost .o  KMDIR}generate.o 
KNDIR)ep8ilon.o  KNDIR}io.o  $(KDIR)nathz.o  S(HDIR)op8.o  $(NDIR)tiBing.o  -Im  -host 

-o  KROOTCODE) 

83 

84  KROOTCODE). o:  KROOTCODE). c  KHEADER).h 

85 

86 

87  #  Transputer  Section 

88 

89  transputer:  KROOTCODE)  .tld  KlODECODE)  .tld 

90 

91  KROOTCODE).  tld:  KROOTCODE)  .trl 

92  echo  FLAG  c  >  $(ROOTCODE) .InX 

93  echo  LIST  $(ROOTCODE)  .Bap  »  KROOTCODE)  .Ink 

94  echo  IIPUT  KROOTCODE) .  trl  »  KROOTCODE)  .Ink 

95  echo  EITRY  KREITRY)  »  $(ROOTCODE) .  Ink 

96  echo  LIBRARY  KT4LIB)  »  KROOTCODE)  .Ink 

97  tlnk  KROOTCODE). Ink 

98 
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99  l(ROOTCODE).trl:  t(ROOTCODE) .tal 

100  tata  $(R00TC0DE).tal  $(TASM0PT) 

101 

102  t(R00TC0DE).tal:  KROOTCODE)  .pp 

103  tcz  KROOTCODE)  .pp  $(TCX0PT4) 

104 

105  $(ROOTCODE).pp:  KROOTCODE)  .c 

106  pp  KROOTCODE). c  KPP0PT4) 

107 

108 

109 

110 
111 

112  # -  RODE  CODE  - 

113  « 

114 

115  *  iPSC/2  Section 

116 

117  KIODECODE):  KlODECODE).o 

118  ec  KIODECODE). o  KMDlR)allocate.o  KNDIR)coaBmode.o  KNDIR)generate.o  KNDIR)io.o 
t(MDIR)Bathz.o  KMDIR)op8.o  KMDIR)tiBing.o  -node  -la  -o  RClODECODE) 

119 

120  KIODECODE). o:  KIODECODE). c  KHEADER).li 

121 
122 

123  *  Transputer  Section 

124 

125  KIODECODE).  tld:  KIODECODE)  .trl 

126  echo  FLAG  c  >  KIODECODE)  .InX 

127  echo  LIST  KIODECODE)  .aap  »  KIODECODE)  .Ink 

128  echo  IIPUT  KIODECODE). trl  »  KIODECODE) .Ink 

129  echo  EITRY  KlEITRY)  »  KIODECODE)  .Ink 

130  echo  LIBRARY  KT8LIB)  »  KIODECODE)  .  Ink 

131  tlnk  KIODECODE). Ink 

132 

133  KIODECODE). trl:  KIODECODE)  .tal 

134  tasa  KIODECODE). tal  KTASMOPT) 

135 

136  KIODECODE).  tal:  KIODECODE)  .pp 

137  tcz  KIODECODE). pp  KTCX0PT8) 

138 

139  KIODECODE).  pp:  $(IODECODE)  .  c 

140  pp  KIODECODE). c  KPP0PT8) 

141 

142 

143 

144 

145 

146  # -  EXECUTIOI  - 

147  « 
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M8 

149  inn:  $(ROOTCODE)  $(IODECODE) 

150  $(R00TC0DE) 

151 

152  tnn:  KROOTCODE)  .tld  KlODECODE)  .tld  f (lIF.FILE) .nil 

153  echo  aakecube  first 

154  Id-net  $(iIF_FILE)  -t  -v 

155 

156 

157  « -  CLEAR  UP  - 

156  « 

159 

160  iclean: 

161  ra  t(IODECODE).o 

162  ra  KROOTCODE)  .0 

163  ra  KIODECODE) 

164  ra  KROOTCODE) 

165 

166  tclean; 

167  del  $(ROOTCODE) .Ink 
166  del  KIODECODE) .  Ink 

169  del  KROOTCODE)  .aap 

170  del  KIODECODE)  .aap 

171  del  KROOTCODE). tal 

172  del  KIODECODE). tal 

173  del  KROOTCODE). pp 

174  del  KIODECODE). pp 

175  del  KROOTCODE). trl 

176  del  $(IODECODE).trl 

177 
176 

179  #  EOF  gfpp.aak  - 
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8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 


- ________  ietwORK  IIFQRJUTIOI  PILE  ======== - 

SOURCE  gfpp.ail 

VERSIOI  :  1.0 

DATE  14  Saptaaber  1991 

AUTHOR  Jonathan  E.  Hartaan,  U.  S.  laval  Postgraduate  School 

USAGE  Id-nat  gfpp 


- REFEREICES  ================ - 

[l]  Inaos.  INS  B012  User  Guide  and  Reference  Manual.  Inaos  Liaited, 
1988,  Fig.  26.  p.  28. 


- descriptioi  =============== - 

latHork  Inforaation  File  (IIP)  used  by  Logical  Systeas  C  (version  89.1) 
LD-IET  letuork  Loader.  This  file  prescribes  the  loading  action  to  take 
place  when  the  'Id-net'  coaaand  is  given  as  in  USAGE  above. 

- hardware  PREREQUISITES  ========= - 

lOTE:  There  are  three  node  nuabering  systeas:  the  one  created  by  Inaos' 
CHECK  program,  the  Gray  code  labeling,  and  the  VIF  labeling.  Since  all 
three  sill  be  used  on  occasion,  I  will  prefix  node  numbers  with  a  C,  G, 
or  I  to  identify  which  systea  I  aa  using! 

The  INS  B004  and  IMS  BO 12  must  be  configured  correctly.  The  B004's  T414 
has  link  0  connected  to  the  host  PC  via  a  serial -to-parallel  converter, 
link  1  connected  to  the  INS  B012  PipeHead,  link  2  connected  to  the  T212 
[communications  manager  (not  used  here)]  on  the  B012,  and  link  3 
connected  to  the  INS  B012  PipeTail  (see  [1]).  By  the  way,  link  2  from 
the  B004  goes  to  the  the  ConfigUp  slot  just  under  the  PipeHead  slot 
(this  connects  it  to  the  T212) .  Finally,  the  B004's  Down  link  aust  run 
to  the  B012's  Up  link. 


- SETTIIG  THE  C004  CROSSBAR  SWITCHES  ==== - 

Once  you  have  connected  the  hardvare  in  the  fashion  mentioned  above, 
the  system  is  ready  to  be  transforaed  to  a  hypercube.  Three  codes  by 
Mike  Esposito  are  used  here:  t2.nif,  root.tld,  and  snitch. tld.  I  have 
a  batch  file  called  'nakecube.bat'  that  performs  a  'Id-net  t2'  also. 

Mike's  code  passes  instructions  to  the  T212  on  the  B012;  shich,  in-tum 
tells  the  C004's  hos  to  connect  their  snitches.  After  the  code  has 
executed,  the  (very  specific)  configuration  that  ne  are  looking  for 
nil!  exist.  Specifically,  the  following  (output  froa  CHECK  /R)  is  what 
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51  ;  thia  process  gives  us: 

52  ; 

53  :  check  1.21 

54  ;  «  Part  rate  Kb  Bt  [  LinkO  Linkl  Link2  Links  ] 

55  :  0  T414b-lS  0.09  0  [  HOST  1:1  2:1  3:2  ] 

56  ;  1  T800C-20  0.80  1  [  4:3  0:1  S:1  6:0] 

57  ;  2  T2  -17  0.49  1  [  C004  0:2  ...  C004  ] 

56  :  3  T800C-20  0.80  2  [  7:3  8:2  0:3  9:0  ] 

59  ;  4  T800C-20  0.76  3  [  9:3  10:2  11:1  1:0] 

60  ;  6  T800d-20  0.90  1  [  8:3  1:2  10:1  12:0] 

61  :  6  T800d-20  0.76  0  [  1:3  12:2  7:1  11:0] 

62  ;  7  T800d-20  0.76  3  [  13:3  6:2  14:1  3:0] 

63  ;  8  T800d-20  0.90  2  C  14:3  1E:2  3:1  5:0] 

64  ;  9  T800C-20  0.77  0  [  3:3  13:2  15:1  4:0] 

65  ;  10  T800d-20  0.90  2  [  16:3  5:2  4:1  15:0] 

66  :  11  T800d-20  0.90  1  [  6:3  4:2  16:1  13:0] 

67  :  12  T800d-20  0.77  0  [  5:3  16:2  6:1  14:0] 

66  ;  13  T800d-20  0.77  3  [  11:3  17:2  9:1  7:0] 

69  ;  14  T800C-20  0.90  1  [  12:3  7:2  17:1  8:0] 

70  ;  15  T800C-20  0.90  2  C  10:3  9:2  8:1  17:0] 

71  ;  16  T800C-20  0.76  3  [  17:3  11:2  12:1  10:0] 

72  ;  17  T800d-20  0.88  2  [  15:3  14:2  13:1  16:0] 

73  ; 

74  ;  Here  node  CO  is  the  root  transputer  (on  the  IMS  B004)  and  node  C2  is 

75  ;  the  T212  (on  the  IMS  B012) .  The  other  sixteen  nodes  are  the  T800's 

76  ;  that  are  used  for  the  vork.  A  logical  interconnection  topology  is 

77  ;  described  belov. 

76  ; 

79  ; 

go  ;  - -  TOPOLOGY  ================ - 

61  ; 

62  ;  The  physical  interconnection  scheme  described  above  is  an  actual  4-cube 

63  ;  with  one  exception.  The  root  node  (CO)  is  situated  BETVEEI  nodes  Cl 

64  ;  and  C3  (which  would  be  connected  directly  in  the  usual  4-cube) .  This 

65  ;  gives  us  two  3-cubes:  one  whose  node  labeling  is  GOxxx  and  the  other, 

66  :  whose  node  labeling  is  Glxxx  (where  the  xxx  represents  all  permutations 

67  ;  of  3-bits}.  These  are  the  usual  three  cubes,  ud  they  will  exist  if  we 

86  ;  define  the  node  number ing/labeling  correctly. 

69  ; 

90  : 

91  :  - =====*==========  STRATEGY  ================ - 

92  : 

93  ;  The  node  labeling  established  by  the  HIF  is  available  via  the  variable 

94  ;  _node_number  (see  <conc.b>)  in  source  code.  Therefore,  we  would  like  a 

95  ;  smart  labeling  scheme  in  the  IIF  file  so  that  programming  is  easier. 

96  ;  This,  of  course,  is  subject  to  the  restriction  that  IIF  labels  begin 

97  ;  with  II  and  so  on. 

98  : 

99  ;  One  such  method  would  be  to  define  a  IIF  labeling  so  that  the  Gray  code 

100  :  label  for  a  node  would  be  (_node_number  -  2).  In  fact,  this  is 
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gfpp.nif 


101 

102 

103 

104 

105 

106 
107 
lOS 
109 
no 
111 
112 

113 

114 

115 

116 

117 

118 

119 

120 
121 
122 

123 

124 

125 

126 

127 

128 

129 

130 

131 

132 

133 

134 

135 


possible  and  the  adjacencies  defined  belov  alios  ns  to  realize  this 
feature.  Below,  node  10  is  the  host  PC,  node  II  is  the  root  transputer 
(T414  on  the  B004) ,  12  through  117  correspond  to  CO  through  CIS  (the 
nodes  of  a  4-cube},  and  118  is  not  used  (but  it's  the  T212) . 


host.server  cio.exe;  (default) 


lODE 

ID 

1. 

2, 

3, 

4, 
6, 
6, 

7, 

8, 
9, 

10, 

11. 

12, 

13, 

14, 

15. 

16. 

17. 

18, 


TRAISPUTER 
LOADABLE 
CODE  (.tld) 

gf pphost . 
gf ppnode , 
gf ppnode , 
gf ppnode , 
gf ppnode . 
gf ppnode . 
gf ppnode , 
gf ppnode , 
gf ppnode , 
gf ppnode , 
gf ppnode , 
gf ppnode , 
gf ppnode , 
gf ppnode, 
gf ppnode , 
gf ppnode, 
gf ppnode , 
switch. 


RESET 

COMES 

FROM: 

rO. 

rl, 

r2, 

rS. 

r3. 

r7. 

r9, 

r4, 

r8. 

rll, 

rl3, 

rl6, 

rl2. 

r6, 

rl4, 

rl7. 

rl5, 

8l, 


DESCRIPTIOI  OF  LIIK  COIIECTIOIS 


LIIKO 

0. 

4. 

11. 

12. 

9. 

2. 

3. 

6. 

17. 

14. 

16. 

10. 

6. 

16, 

7. 

8, 

13, 


LIIKI  LIIK2 


2. 

1. 

2, 

6. 

3. 

7. 
9. 

4. 

8. 

11. 

13. 
16. 
12. 

6, 

14, 
17, 
16, 

1. 


3. 
6, 
8. 

4. 

14, 
6. 
9, 
7. 
1. 

10. 

13. 

11. 

15, 
17. 
12, 

16, 


LIIK3 


10 

6 

7 
2 

13 

8 

15 

16 
S 

12 

3 

4 
17 
10 
11 

14 
9 


B004 

B012 


T212 


EOF  gfpp.nif 
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PROGRAM  IIFORNATIOI 


gf.h 


1  / 
2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

13 

13 

14 

15 

16 
17 
IS 

19 

20 
31 
22 
23 
34 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 


SOURCE  :  gl.h 

VERSIOH  :  2.5 

DATE  21  SsptSBbar  1991 

AUTHOR  Jonathan  E.  Hartnan,  U.  S.  laval  Postgraduate  School 


SEE  ALSO :  gf pc . Bah 

glpp.Bak 
glpchost . c 
gTpphost . c 
gTpcnode . c 
gf ppnode . c 


Bakefile  for  the 
Bakefile  for  the 
host  code  for  the 
host  code  for  the 
node  code  for  the 
node  code  for  the 


coBplete  pivoting  case 
partial  pivoting  case 
coBplete  pivoting  case 
partial  pivoting  case 
coBplete  pivoting  case 
partial  pivoting  case 


- ==============  refereices  ============== - 

[1]  Gragg,  William  B.  MATLAB  code  and  personal  conversations,  1991. 


- -  DESCRIPTIOI  ============= - 

This  header  file  is  shared  by  several  prograBS  (listed  above).  Each  of 
these  codes  has  soBething  to  do  «ith  a  parallel  impleBentation  of  Gauss 
Factorization  (GF).  Several  pivoting  strategies  are  supported.  Files 
like  gfpce.e  represent  a  COMPLETE  pivoting  strategy,  and  the  files  like 
gfppe.e  give  the  corresponding  code  for  the  PARTIAL  pivoting  scheme. 

The  basic  algorithm  is  from  [1] .  Parallelism  is  sought  by  distributing 
the  coliunns  of  A  across  the  nodes  of  a  Bultiprocessor  system  (using  the 
hypercube  interconnection  topology) .  The  program  is  designed  for  the 
Intel  iPSC/2  or  a  network  of  Inmos  transputers. 

The  algorithm  factors  Q'AP  =  LU  with  P  and  Q  permutation  matrices,  L 
unit  lower  trapezoidal  (r  columns)  and  U  upper  trapezoidal  with  nonzero 
diagonal  eleBents  (r  rows).  The  program  is  designed  for  a  general 
matrix,  A.  It  does  not  assume  A  square  or  sparse.  There  is  no  effort 
to  optimize  for  this,  or  any  other,  special  structure.  There  is  one 
caveat:  I  designed  the  code  to  gather  data  for  square  matrices  of  full 
rank.  Therefore,  I  have  tested  the  square  case  of  random  matrices  very 
carefully.  While  the  code  should  work  for  any  general  matrix,  it  has 
not  been  carefully  tested  in  other  cases.  Additionally,  since  I  sought 
timing  data  for  matrices  of  full  rank,  I  have  HOT  addressed  the  problem 
of  gathering  columns  (back  to  the  host)  to  the  right  of  the  final  pivot 
for  rank-deficient  matrices.  This  would  not  be  a  difficult  task,  but  I 
did  not  make  this  effort  since  it  has  no  bearing  on  my  goal. 

In  the  partial  pivoting  code,  the  search  for  pivots  is  carried  out  only 
in  the  pivot  column,  so  P  is  the  identity  (i.e.,  there  are  no  column 
interchanges).  Many  of  the  remaining  comments  pertain  to  the  complete 
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gf.h 


61  * 

52  * 

53  ♦ 

54  * 

55  ♦ 

56  • 

57  * 

58  * 

59  *  ■ 

60  ♦/ 
61 

62 

63 

64  /♦ 

65  * 

66  ♦ 

67  ♦ 

68  ♦ 

69  ♦ 

70  ♦ 

71  ♦ 

72  * 

73  ♦ 

74  ♦ 

75  • 

76  * 

77  * 

78  ♦ 

79  ♦ 

80  ♦ 

81  * 
82  * 

83  ♦ 

84  ♦ 

85  * 

86  • 

87  ♦ 

88  * 

89  * 

90  * 

91  • 

92  * 

93  ♦ 

94  * 

95  * 

96  * 

97  * 

98  * 

99  * 
100  * 


pivoting  caso.  ainca  it  is  the  aoat  challenging.  The  changes  for  the 
partial  pivoting  case  should  be  evident  in  aost  cases.  At  tiaes,  shen 
the  changes  are  not  necessarily  evident,  clarifying  reaarks  address  the 
partial  pivoting  scheae.  This  header  file  contains  the  aajority  of  the 
background  and  algoritha  inforaation,  but  if  you’re  after  a  careful 
study  of  the  differences,  coapare  the  source  codes.  The  algoritha  belos 
gives  a  road  aap  through  the  code. 


- ===========  ALGORITHM:  BACKGROUID  ========== - 

1. )  Preliminaries.  Consider  A  (a  x  n),  a  aatrix  of  real  numbers.  The 
permutation  vectors,  p  and  q,  characterize  column  and  rou  permutations 
(respectively).  The  scalar,  (g/a),  is  the  grouth  factor.  The  integer, 
r,  is  a  fairly  reasonable  determination  of  the  ‘numerical  rank’  of  A. 
The  C  language  convention  is  folloved,  numbering  rous  and  columns  from 
zero;  and  storing  dynamic,  tso-diaensional  arrays  (matrices)  in  rov- 
aajor-order.  The  ‘pivot’  will  be  that  element  located  at  A(k,k).  The 
area  (in  A)  below  and  to  the  right  of  the  pivot  [all  A(i,j)  shore  i  >  k 
and  j  >  k  ]  is  called  the  ‘Gauss  transform  area’. 

2. )  Communications  and  Coordination.  Let  I  be  the  number  of  processors 
(workers)  in  the  hypercube.  These  nodes  are  labeled  with  a  Gray  code 

{  0  . .  (I  -  1)  }.  The  root  (host)  node  distributes  the  columns  of  A  to 
the  nodes.  This  is  done  cyclically,  using  the  C  modulus  operator  ('/,). 
That  is,  column  j  will  be  sent  to  processor  (j  mod  I).  Once  the  nodes 
have  their  columns,  they  begin  work.  Communication  (for  the  complete 
pivoting  case)  involves  an  election  process  for  the  next  pivot,  where 
each  of  the  nodes  finds  its  best  candidate  and  then  the  election  finds 
the  best  candidate  in  the  global  picture.  This  is  done  in  lg(l)  steps 
using  the  cubecast_from()  function. 

The  partial  pivoting  case  does  not  require  the  election  process  that 
complete  pivoting  needs,  but  both  methods  look  similar  (in  terms  of 
communication)  after  the  elections  are  complete.  The  node  holding  the 
pivot  column  must  perform  the  pivot  column  arithmetic  and  distribute 
the  resulting  pivot  column  (also  in  lg(l)  steps)  to  the  other  nodes. 
Communications  functions  are  not  explained  much  in  this  code,  but 
details  can  be  found  in  the  files  cosmi.h  ft  comm.c. 

3. )  Pivoting  Strategy.  The  complete  pivoting  strategy’s  election 
process  (at  each  stage),  determines  the  element  in  (the  entire  Gauss 
transform  area  of)  A  that  is  largest  in  absolute  value.  This  element 
wins  the  election  and  is  ‘moved’  to  A(k,k)  for  the  upcoming  stage.  It 
isn’t  really  moved. . .but  p  and  q  are  updated  so  that  we  can  keep  track 
of  permutations.  During  the  search  for  the  new  pivot,  candidates  are 
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101  *  denoted  A(8,t)  =  u.  The  largest  of  the  candidates  is  installed  as  the 

102  *  next  pivot.  There  seems  to  be  too  much  overhead  associated  vith  this 

103  *  fancy  indexing  off  of  pC]  and  qD .  For  the  partial  pivoting  code,  I 

104  *  chose  to  ACTUALLY  SWAP  roes  (if  necessary)  at  each  stage.  This  makes 

105  e  the  'pp'  code  a  bit  easier  to  read. 

106  * 

107  *  4.)  Stopping.  The  GF  process  is  repeated  until  one  of  tso  criteria  is 

los  e  satisfied.  First,  of  course.  «e  may  run  out  of  matrix.  Secondly,  ue 

109  *  may  find  a  pivot  whose  absolute  value  is  less  than  our  tolerance  (tol) . 

no  *  In  the  latter  case,  ve  have  a  rank-deficient  A.  Currently,  the  codes 

111  *  recognize  rank-deficiency  and  bail  out  of  the  iteration  loop;  but  they 

112  *  do  not  gather  (to  the  host)  all  of  the  remaining  columns  to  the  right 

113  *  of  the  last  pivot.  This  is  discussed  above. 

114  * 

115  ♦ 

116  * - ========  ALGORITHM:  THE  GF  PROCESS  ======= - 

117  * 

116  «  0.)  Initialization.  Let  dim  be  the  dimension  of  the  h3rpercube.  Let 

119  *  k  =  0.  Search  A  and  find  the  largest  (in  absolute  value)  element,  u. 

120  e  This  is  done  at  each  node.  Once  each  node  has  a  local  candidate  for 

121  *  the  next  pivot,  an  election  is  held,  dimension-by-dimension.  This 

122  *  requires  (dim)  steps,  and  when  it  is  finished,  every  processor  knows 

123  *  exactly  the  position  and  value  of  the  next  pivot.  Exception:  In  the 

124  ♦  partial  pivoting  code,  the  processor  which  has  the  pivot  column  simply 

125  •  searches  the  (proper  part  of  the)  pivot  column  for  the  next  pivot  and 

126  *  then  informs  the  other  processors. 

127  * 

126  *  1.)  Status.  Every  node  knows  the  position  and  value  of  the  next  pivot, 

129  *  namely  u  =  A(8,t);  and  where  it  should  be  installed,  A(k,k).  The  growth 

130  *  rate  is  adjusted:  g  =  maxCg,  abe(u)].  If  (u  <  tol),  then  A  is  rank- 

131  *  deficient  and  we  exit  the  loop  (using  the  C  ‘break'  statement). 

132  ♦ 

133  *  2.)  Permutations.  Ve  account  for  the  interchange  of  rows  s  and  k  and 

134  *  columns  t  and  k  by  swapping  the  elements  of  pD  that  are  indexed  by  k 

135  *  and  t  and  swapping  the  elements  in  qD  indexed  by  k  and  s.  This 

136  •  (effectively)  establishes  the  new  pivot  at  A(k,k).  The  column  permu- 

137  *  tation  vector  has  no  significance  in  the  peurtial  pivoting  case  since 

136  *  it  would  never  be  changed.  The  matrix,  P,  in  this  case,  is  simply  the 

139  •  identity. 

140  * 

141  *  3.)  Adjust  the  Gauss  Transform  Area. 

142  * 

143  *  (a)  In  the  (single)  node  that  holds  the  new  pivot's  column  (k), 

144  *  divide  every  element  below  the  pivot  by  the  pivot  vedue.  Broadcast 

145  *  this  column  to  every  other  node,  lode  0  updates  the  manager,  who 

146  *  uses  this  information  to  append  to  his  copy  of  the  resulting 

147  ♦  (factored)  A. 

146  ♦ 

M9  •  (b)  low  every  worker  has  the  updated  column  k.  At  every  node,  do 

150  ♦  the  following:  For  every  element  A(i,j)  [  where  i  >  k  and  j  >  k  ] 
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let  A(i,j)  *  A(i,j)  -  (A(i,k)  ♦  A(k,j)). 


gfh 


151  ' 

152 

153 

154 

155 

156 

157 

158 

159 

160 
161 
162 

163 

164 

165 

166 

167 

168 

169 

170 

171 

172 

173 

174 

175 

176 

177 

178 

179 

180 
181 
182 

183 

184 

185 

186 

187 

188  / 

189 

190 

191 

192 

193 

194 

195 
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4.)  Pivot  Search.  In  the  Gauss  transfora  area,  G,  search  for  the 
eleaent  that  is  largest  in  absolute  value.  Its  position  is  A(s,t)  and 
its  value  is  u.  The  candidates  are  chosen  at  the  local  (processor) 
level,  then  an  election  is  held  at  the  global  level  to  determine  the 
best  candidate  in  the  same  manner  that  mas  described  in  step  0. 
Increment  k.  Repeat  the  process  (go  back  to  step  1).  The  obvious 
exceptions  apply  to  the  partial  pivoting  case. 


- joTES  FOR  IMPROVEMEIT  ========== - 

Currently  the  code  does  not  give  full  support  for  rank-deficiency.  It 
DOES  break  out  of  the  loop,  but  everything  to  the  right  of  the  final 
pivot  column  vill  be  garbage.  It  would  be  relatively  easy  to  add  the 
necessary  post-iteration  rank-deficiency  check  and  coalesce  each  of  the 
remaining  columns  back  to  the  manager,  but  this  code  sis  created  to 
test  the  full-rank  cases  and  take  performance  data. 

Secondly,  there  is  the  issue  of  whether  it  is  better  for  the  manager  to 
receive  each  pivot  column  as  it  becomes  available,  or  if  all  columns 
should  be  sent  in  at  the  end.  I’m  not  yet  sure  which  method  is  better, 
but  the  current  code  keeps  the  root  node  up-to-date  at  each  stage.  This 
is  probably  the  best  solution  to  the  problem  above  and  would  probably 
enhance  performance  during  the  iterations’  It  REALLY  SHOULD  BE  TESTED! 

There  are  many  other  questions  that  pertain  to  optimization  that  remain 
unanswered  (especially  in  the  complete  pivoting  case). 


/ 


- ===========  ALGORITHM:  COICLUSIOI  ========== - 

1. )  Rank.  Set  r,  the  rank  of  A,  equal  to  the  number  of  iterations  that 

were  executed.  This  is  automatic  in  the  manager  (host)  code  since 
the  integer,  r,  is  used  as  the  loop  index.  The  worker  nodes  use  k  for 
a  loop  index  variable. 

2. )  Interchanges.  Row  and  column  interchanges  are  not  actually  done  in 
the  complete  pivoting  code.  Instead,  we  maintain  permutation  vectors, 
pC]  and  qC] .  You  may  note  that  while  both  vectors  are  used  heavily 
during  the  GF  process  q[] ,  in  particular,  comes  in  handy  at  the  end  to 
set  A  in  order.  The  partial  pivoting  code  performs  the  actual  inter¬ 
changes  of  rows.  At  first,  we  would  be  inclined  to  believe  that  the 
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indexing  by  p[]  and  q[]  lands  to  batter  parfoxaanca,  but  there  is  no 
clear  timing  evidence  (at  this  point)  that  supports  this  idea. 


3.)  Factors.  The  upper  trapezoidal  matrix,  U,  is  the  upper  trapezoid 
of  (the  resulting,  factored)  i  (the  diagonal  of  A  and  everything  above 
that).  The  loser  trapezoidal  matrix,  L,  is  formed  by  placing  ones  on 
the  diagonal  of  A;  zeros  above;  and  copying  the  loser  trapezoid  of  A 
(excluding  the  diagonal).  To  form  Q*AP,  se  use  THE  ORIGIIAL  copy  of  A 
(not  the  factored,  resulting  A)  and  the  matrices  Q  and  P  that  are 
implied  by  qD  and  pD  .  That  is,  in  the  end,  se  set  Q[qCi]][i]  =  1.0 
for  all  i  in  {  0,  1,  ...,  (m-l)  >  and  set  P[p[j]][j]  =  1.0  for  all  j 
in  {  0,  1 ,  . . . ,  (n-1)  }. 


NAIIFEST  COISTAITS 


Section  1:  Communications  Aids  (Message  Types  and  Type  Selectors) 

The  follosing  manifest  constants  simplify  the  communications  effort. 

The  TRAHSPUTER  section  is  fairly  general  in  nature.  The  iPSC/2  section 
specifies  types  and  type  selectors  for  csendO  and  crecv().  It  IS 
SIGHIFICAKT  that  lODE.OFFSET  is  the  largest  of  these.  It  must  remain 
the  largest  so  that  (for  all  nodes  n)  the  value  of  (n  *  lODE.OFFSET) 
cannot  be  equal  to  one  of  the  other  message  types  (consider  n  ==  0) . 


237  #ifdef  TRAHSPUTER 
33S 

239  Adefine  CUBESIZE 

240  Adefine  DINEHSIOI 

241 

242  Aelse  /*  iPSC/2  ♦/ 

243 


/*  change  these  for  a  cube  of  other  dim 


244 

Adef ine 

ARG.TYPE 

1 

/• 

for  passing  command  line  argument 

info 

*/ 

245 

Adef ine 

COL.SIZE.TYPE 

2 

/■* 

for  sending  n  part  of  size(A)  ==> 

cols 

•/ 

246 

Adef ine 

COL.TYPE 

3 

/* 

use  this  to  send  a  column 

*/ 

247 

Adef ine 

PIVOT.TYPE 

4 

/• 

candidate  for  next  pivot 

•/ 

248 

Adef ine 

PCOL.TYPE 

6 

/* 

use  this  to  send  a  pivot  column 

*/ 

249 

Adef ine 

ROW.SIZE.TYPE 

6 

/* 

for  sending  m  part  of  size(A)  ==> 

rovs 

*/ 

250 

Adef ine 

lODE.OFFSET 

7 

/* 

for  sending  messages  from  nodes 

*/ 

351 

352  tendil 

353 

354 

255  /• - 

256  * 

357  *  Section  3:  Timing 

258  * 

359  *  The  root  uses  a  tvo-dimeneional  array  mhare  the  rows  are  indexed  by  the 

360  *  node  numbers  and  the  columns  use  the  lollosing  indexing.  The  nodes,  of 

261  *  course,  only  need  a  one-dimensional  array  vith  indexing  according  to 

362  *  the  follouing  scheme.  There  a  total  of  NAI.EVEITS  elements  in  the 

363  e  array,  and  indexing  for  a  specific  event  is  given  by  START.TINE,  SETUP, 

264  *  and  so  on.  The  partial  pivoting  case  does  not  use  all  of  the  events. 

265  • 

266  * - 

267  •/ 

368 

369 


270 

271 

fdefine 

MAX.EVEHTS 

18 

/♦ 

number  of  events  that  we  want  to  time 

*/ 

272 

273 

fdefine 

DATA.SOURCE 

0 

/• 

node 

number  of  source  of  the  data 

*/ 

274 

fdefine 

START.TIME 

1 

/• 

t(0) 

ss>  starting  time  for  the  node 

*/ 

375 

fdefine 

SETUP 

2 

/* 

from 

t(0)  until  starting  to  receive  cols 

*/ 

376 

fdefine 

DISTRIB.COLS 

3 

/* 

time 

to  distribute  columns 

*/ 

277 

f def ine 

FIRST.PIVOT 

4 

/• 

from  receipt  of  last  col  to  start  iter 

•/ 

278 

379  /*  The  next  tvo  only  apply  to  nodes  zero  and  eight  */ 

380  #define  PCOLS.TO.HOST  S  /*  time  spent  passing  pivot  cols  to  host  */ 

381  tdefine  PIV0TS_T0_H0ST  6  /*  time  spent  passing  pivots  to  host  */ 

382 

283  /*  The  next  five  kind  of  represent  the  big  picture  */ 

284  tdefine  PIVOT.ELECTIOM  7  /*  time  spent  on  pivot  elections  e/ 


385  #define  UPDATIMG.PQ  8  /*  time  spent  updating  permutations  p  and  q  */ 


286  fdefine  PCOL.ARITHMETIC  9  /*  time  spent  on  pivot  column  arithmetic  */ 

387  tdefine  PCOL.DISTRIB  10  /*  time  spent  distributing  pivot  columns  */ 

388  fdefine  UPDATIIG.G  11  /*  time  spent  updating  the  Gauss  transform  */ 

389 

390  /*  The  next  four  are  times  from  vithin  update_G()  */ 

391  fdefine  PRLTINE  12  /*  pivot  row  location  time  */ 

393  fdefine  LCTINE  13  /*  time  to  determine  if  a  column  is  local  */ 

393  fdefine  G.ARITHNETIC  14  /*  time  spent  on  arithmetic  within  G  */ 

394  fdefine  LOOPTINE  15  /*  time  for  both  forO  loops  in  update_G()  */ 

395 

296  /*  The  last  tvo  are  back  at  the  big  picture  level  again  */ 

297  fdefine  ITERATIOI  16  /*  time  checked  before  and  after  iteration  */ 

398  fdefine  STOP  17  /*  the  last  time  sampled  by  the  node  */ 

399 
300 
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gf.h 


301 

302 

303 

304  /• - 

305  * 

306  *  Saction  4:  Ganaral 

307  * 

308  * - 

309  */ 

310 

311  tdalina  AFT  4  /*  nuabar  of  digita  to  print  aftar  daciaal  */ 

312  fdafina  VIDTH  6  /«  nuabar  of  charactara  (including  daciaal)  */ 

313 

314 

315 

316 

317 

318  /♦ - 

319  ♦ 

320  a  Saction  6:  A  apacial  flag  uaad  for  tba  id  fiald  of  a  pivot.  Vhan  it 

321  a  appaars,  it  indicataa  that  tha  aanding  noda’a  part  of  A  haa 

322  a  no  alamants  as  big  as  the  tolaranca,  tol;  and  tharafora  this  noda’s 

323  a  candidata  for  pivot  should  not  ba  conaidarad. 

324  a 

325  a - 

326  a/ 

327 

328 

329  #dafina  RAHK.DEFICIEKT  -1 

330 

331 

332 

333 

334 

335  /a - ===========  type  DEFIIITIOIS  =========== - a/ 

336 

337 

338  typadaf  struct  { 

339 


340 

int 

id; 

341 

doubla 

u; 

342 

int 

«. 

343 

t; 

344 

345  }  Pivot.Typa; 

346 


347 

348  /a - ===============  EOF  gf.h  =============== - a/ 
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gfpphost.c 


1  /* - -  program  IIFORMITIOI  ========== - 

2  • 

3  *  SOURCE  glpphost . c 

4  *  VERSION  :  2.0 

s  *  DATE  21  Septeaber  1991 

6  *  AUTHOR  Jonathan  E.  Hartaan,  U.  S.  Naval  Postgraduate  School 

7  * 

8  * - DESCRIPTION  ============== - 

9  * 

10  *  Gauss  Factorization  (GF)  sith  Partial  Pivoting:  Parallel  Version. 

11  *  This  is  the  aanager  portion  of  the  code.  See  [gf.h]  ior  details. 

12  * 

M  •/ 

15 

16  tinclude  <8tdio.h> 

17  tinclude  <string.h> 

18 

19  #ildef  TRANSPUTER 

20 

21  Ninclude  <conc.h> 

22  tinclude  <8tdlib.h>  /*  addlreeO,  .heapend  */ 

23 

24  tinclude  <aatriz.h> 

25  tinclude  <aacro8.h> 

26  tinclude  <allocate.h> 

27  tinclude  <clargs.h> 

28  tinclude  <coam.h> 

29  tinclude  <ep8ilon.h> 

30  tinclude  <generate.h> 

31  tinclude  <io.h> 

32  tinclude  <ops.h> 

33  tinclude  <tiaing.h> 

34 

35  telse  /*  iPSC/2  ♦/ 

36 

37  tinclude  "/usr/hartaan/aatlib/aatrix.h" 

38  tinclude  "/usr/hartaan/aatlib/aacros.h" 

39  tinclude  "/usr/hartaan/aatlib/allocate.h" 

40  tinclude  "/usr/hartaan/aatlib/clargs.h" 

41  tinclude  "/usr/hartaan/aatlib/coaa.h" 

42  tinclude  "/usr/hartaan/aatlib/epsilon.h" 

43  tinclude  "/usr/hartaan/aatlib/generate.h" 

44  tinclude  ''/usr/hartaan/aatlib/io.h*' 

45  tinclude  "/usr/hartaan/aatlib/ops.h" 

46  tinclude  "/usr/hartaan/aatlib/tiaing.h" 

47  tendif 

48 

49  tinclude  "gf.h" 
so 
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51 

53 

53  /* - ===========  MAIIFEST  CQISTAITS  *========== - 

5<  ♦ 

55  *  Th«  following  unilost  constants  are  ussd  to  datanins  the  sizs  of  the 

56  *  option  list,  optv[]:  indexing  associated  with  valid  cosmand  line 

57  *  arguaents;  and  selection  constants  for  the  user’s  choice  of  aatriz  type 
56  e  [used  in  generateO]  . 

59  * 

60  */ 

61 


63 

*def ine 

BUNBER.OF.ARGS 

3 

/* 

-d  -t  -V 

•/ 

63 

64 

#daf ins 

DIN 

0 

/♦ 

index  into  optvD 

*/ 

65 

*def ine 

TINIBG 

1 

/* 

II  II  II 

*/ 

66 

*def ine 

VERBOSE 

2 

/* 

II  II  II 

*/ 

67 

68 

tdef ine 

SELECT.QUIT 

0 

/* 

■enu  /  aatrix  selection 

*/ 

69 

tdef ine 

SELECT.IDEBTITY 

1 

70 

tdef ine 

SELECT.HILBERT 

2 

71 

tdef ine 

SELECT.RABDON 

3 

72 

tdef ine 

SELECT.UILKIBSOB 

4 

73 

74 

75 

76 

77 

78  /* - =================  GLOBALS  ================= - •/ 

79 

80 

81  Static  char  version □  =  "Parallel  GF  with  Partial  Pivoting,  Version  2.0"; 

82 

83 

84  «ifdef  TRABSPUTER 

85 

86  Channel  eic[(CUBESIZE  1)], 

87  *oc[(CUBESIZE  4-  1)]; 

88 

89  Belse  /*  iPSC/2  */ 

90 

91  Static  char  ecubenaae; 

92 

93  static  char  enodecode  =  "gfppnode"; 

94 

95  «endif  /*  TRABSPUTER  */ 

96 

97 

98  Static  Arg.Struct  *optv[BUMBER_OF_ARGS] ; 

99 
100 
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101 

102 

103 

104  /* - S=======S==  FUlCTIOI  DEFIIITIOl  =========== - 

105  * 

106  *  The  atmcture  is  dalinad  aora  caralullj  in  clajrgs.h,  but  the  basic  idea 

107  *  is  that  we  have  an  array  of  pointers  to  type  Arg_Struct . . . in  this  case, 

108  *  there  are  IUNBER_0F_ARGS  valid  arguaents  and  the  next  lev  steps  take 

109  *  care  ol  allocation  and  definition  of  than.  The  -d  arguaent  allows  the 

110  *  user  to  enter  the  desired  diaension  of  the  hypercube,  -t  sets  tiaing  on 

111  *  and  -V  is  used  to  set  verbose  on. 

112  */ 

113 

114  void  define.valid.argsO  { 

115 

116  Static  int  interpret []  =  -C  LOIG  >; 

117 

lie 

119  install_complaz_arg(DlM,  optv,  '‘-d",  interpret,  1); 

120 

121  install.siaple.argCTIMIllG,  optv,  ”-t”); 

122  install_simple_arg(VERBOSE,  optv,  *'-v"); 

123 

124  } 

125  /*  End  def  ine.valid.argsO  - */ 

126 

127 

128 

129 

130 

131  /* - ===========  FUMCTIOI  DEFIIITIOl  =========== - 

132  * 

133  V  A  simple  function  to  display  the  results.... 

134  */ 

135 

136  #ifdef  PROTOTYPE 

137 

138  void  display _tiaing_data(Double_Katriz_Type  *i. 


139 

int 

dia. 

140 

double 

a. 

141 

double 

eps. 

142 

double 

S> 

143 

double 

tol. 

144 

int 

r. 

145 

double 

**t) 

146 

147  false 

148 

149  void  display.tiaing.dataCA,  dia,  a,  eps,  g,  tol,  r,  t) 

150 
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151 

Double_Matrix_Type  eA; 

IS-* 

int  dim; 

153 

double  a, 

154 

eps. 

155 

g> 

156 

tol; 

157 

int  r; 

158 

double  **t; 

159 

160  #endif 

161  { 

162 

int 

aft . 

163 

cubesize  -  pow2(dim). 

164 

i. 

165 

m  =  A->row8, 

166 

n  =  A->cols, 

167 

width; 

168 

169 

170  #ifdef  TRAHSPUTER  /*  is  measured  in  64  microsecond  ticks  ==>  4-S  places  */ 


171 

172  aft  =  6; 

173  width  =  15; 

174 

175  #el8e  /*  iPSC/2  is  measured  in  milliseconds  -=>  three  places*/ 

176 

177  aft  s  3; 

178  width  =  13; 

179 

180  fendif 

181 

182  printfC - =========  TIMIIG  DATA  ========= - •■); 

183  printfC - \n\n"); 

184 

185  printfC  Hypercube  of  order  Xd  ",  dim); 

186  (dim  ==  0)  ?  (printf("(l  proce88or)\n\n"))  : 

187  (printfCCXd  proce8sor8)\n\n",  cubesize)); 

188 

189  printf ("Problem  size  ==>  size(A}  =  (%d  z  */,d).\n",  m,  n); 

190  printf ("Machine  precision:  eps  =  Xe\n",  eps); 

191  printf ("Tolerance:  tol  =  Xe\n",  tol); 

192  printf ("Growth  factor:  g/a  =  Xe\n",  (g/a)); 

193  printf ("Rank:  rank(A)  =X3d\n",  r  ); 

194  printf ("Units  for  timing  data:  =  secondsNn"}; 

195 

196  for  (i  =  0;  i  <  cubesize;  i++)  { 

197 

198  printf ("\nlode  X2d  Data - ",  i); 

1 99  printf  ( " - \n\n" ) ; 

200 
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301  printK "Setup  and  initialization:  "); 

302  printl("X*.*ll",  width,  aft,  t [i] CSETUP] ) ; 

303  printf ("\nlnitial  colunn  distribution:  "); 

204  printf ("%♦. •If",  width,  aft,  t Ci] [DISTMB.CDLS] ) ; 

305 

206  if  (i  =•  0)  { 

207 

306  printf ("NnTransBission  of  pivot  coluans  to  the  host:  "}; 


209  printf("%*.*lf",  width,  aft,  t [i] [PC0LS_T0_H0ST] ) ; 

310  printf ("XnTransBiss ion  of  pivots  to  the  host:  "); 

211  printf("X*.*lf",  width,  aft,  t [i] [PIVOTS.TO.HOST] ) ; 

212  } 

313 

314  printf ("XnPerforaance  of  pivot  colnwn  arithaetic:  "); 

215  printf  elf ",  width,  aft,  t[i]  [PCOL.ARITHMETIC]  ) ; 

216  printf ("\nDistribution  of  pivot  coluans:  "); 

217  printf  ("•/.•.•If",  width,  aft,  t[i]  [PCOL.DISTRIB]) ; 

216  printf ("XnPerforwance  of  updates  and  arithaetic  in  G:  "); 

219  printf  ("*/,•. elf",  width,  aft,  tCi]  CUPDATIIG_G] ) ; 

220  printf ("XnUpdate.GO :  loop  tine  including  arithaetic:  "); 

221  printf  ("•/,•. elf",  width,  aft,  t [i]  [LOOPTIME] ) ; 

222 


223  printf ("\n\nTime  for  all  work  inside  aain  iteration  loop:  "); 

224  printf  ("•/.•.•If",  width,  aft,  t  [i]  [ITERATIOB] ) ; 

225  printf ("\nTotal  tiae  froa  start  to  stop:  "); 

226  printf  ("•/.•. sUXaVn",  width,  aft,  (t [i]  [STOP] -t [i]  [START.TIME] ) ) ; 

227  > 

226 

229  } 

230  /•  End  display .tiaing.dataO  - •/ 

331 

232 

333 

234 

235 

236  /• - ===========  FUICTIOI  DEFIIITIOI  =========== - 

237  • 

336  •  This  function  distributes  the  coluans  of  A  to  the  nodes  of  the  hyper- 

339  •  cube .  The  loop  variable ,  j ,  designates  each  coluan  of  A  in  turn .  The 

340  •  coluan  buffer,  cbuf[],  copies  froa  A  the  coluan  to  be  transmitted. 

241  •  After  cbufG  is  filled,  [i  =  (j  aod  cubesize)]  aeans  that  node  i  will 

242  •  get  coluan  j  and  the  aodulus  operation  seeas  to  be  a  reasonable  and 

243  •  efficient  scheae  of  distribution.  Finally,  the  call  to  sendO  ships 

344  •  the  coluan  out  to  the  appropriate  node. 

245  • 

246  • - 

247  •/ 

346 

249  #ifdef  PROTOTYPE 

250 
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351  void  diatribut6_coluBn>(Doubl«_Natrix_Type  *1,  int  dim,  double  *cbuf) 

253 

353  #el8e 

354 

255  void  dietribute.coluansCi,  dim,  cbuf) 

356 

257  Double_Matrix_Type  *A: 

358  int  dim ; 

359  double  ecbul ; 

260 

261  fendif 

262  { 

263 

364  int  i , 

265  j, 

266  pos  =  42,  /*  position  oi  print  head  */ 

367  rm  =  LIIE.LEMGTH  -  10;  /*  right  margin  (see  matrix. h)  */ 

368 

269  long  cubesize  =  poB2(dim), 

270  sizeol.col  =  (long)  (A->roH8  ♦  sizeoKdouble)) ; 

271 

272 

373  print! ("Distributing  the  columns  o!  A  to  the  nodes"); 

274 

275  for  (j  *  0;  j  <  A->cols;  j++)  { 

276 

277  for  (i  =  0;  i  <  A->rous;  i++)  {  cbufCi]  =  A->matrixCi] [j] ;  > 

278 

279 

280  i  =  j  '/•  cubesize;  /♦  column  — >  node  i  */ 

381 

282  #ifdef  TRAISPUTER  /*  node  0  has  to  sort  ’em  out  ♦/ 


283 

284 

285 

286 
287 
388 

289 

290 

291 

392 

393  false 

294 

295 

296 

397  iendif 

398 

399 
300 


if  (i  <  8)  { 

8end(0,  (char  e)  cbuf,  sizeof_col,  cubesize); 

} 

else  { 

send(8,  (char  *)  cbuf,  sizeof.col,  cubesize); 

} 

/*  iPSC/2  */ 

sand(i,  (char*)  cbuf,  8izeof_col,  COL.TYPE); 

/•  TRAISPUTER  */ 
print! ("."); 
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301  if  (po8++  >  m)  { 

302 

303  pos  s  0; 

304  printf ("\n") : 

305  } 

306 

307  > 

308 

309  printf ("\nColuim  distribution  coBpl«t«.\n\n'') ; 

310 

311  > 

312  /♦  End  distributs.colunnsO  - ♦/ 

313 

314 

315 

316 

317 

318  /* - ===========  FUICTIOI  DEFIIITIDI  =========== - 

319  * 

320  *  This  function  prompts  ths  user  for  matrix  size  and  type,  then  generates 

321  e  the  matrix  vith  a  call  to  a  function  from  generate. c. 

322  */ 

323 

324 

325  #ifdef  PROTOTYPE 


326 

327  Double^Matrix.Type  egenerateCint  *m,  int  *n) 

328 

329  ielse 

330 

331  Double.Natrix.Type  egenerate(m,  n) 

332 

333  int  *m , 

334  en; 

335  iendif 

336  < 

337  Double^Matrix.Type  *A; 

338 

339  int  matrix.type, 

340  valid  =  FALSE; 

341 

342 

343  printf ("Please  enter  the  number  of  roes  in  A:  ”); 

344  scanf  ("*/.d".  m); 

345  fflush(stdin) ; 

346 

347  printf  ("\n . and  the  number  of  columns  in  A;  "); 

348  scanf ("%d",  n); 

349  fflushCstdin) ; 

350 
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351 

352 

353 

354 

355 

356 

357 

358 

359 

360 

361 

362 

363 

364 

365 

366 

367 

368 

369 

370 

371 

372 

373 

374 

375 

376 

377 

378 

379 

380 

381 

382 

383 

384 

385 

386 

387 

388 

389 

390 

391 

392 

393 

394 

395 

396 

397 

398 

399 

400 


printl("\n\nSelect  from  the  folloving  list  of  utrices:"); 
while  (! valid)  { 


printf ("\n\n") ; 


printf (" 

Xd.) 

QUIT 

\n". 

SELECT.QUIT 

) 

printf ( " 

y^.) 

Identity 

\n". 

SELECT.IDEITITY 

) 

printf (" 

y.d.) 

Hilbert 

\n”. 

SELECT.HILBERT 

) 

printf (" 

y^.) 

Random 

\n". 

SELECT.RAIDOM 

) 

printf (" 

Xd.)  Wilkinson 

\n". 

SELECT.WILKIISOI) 

printfC'W); 

scanf ("Xd" ,  taatrix.type) ; 

fflushCstdin) ; 

switch(aatriz_type}  { 

case  SELECT.IDEITITY  : 
case  SELECT.HILBERT 
case  SELECT.RAIDOM 

case  SELECT.UILKIISOI  :  valid  =  TRUE;  break; 

case  SELECT.QUIT  exit(EXIT_SUCCESS) ; 

> 

}  /♦  end  whileO  */ 


svitchCmatrix.type)  ■( 

case  SELECT.IDEITITY: 

printf ("\n\nGenerating  A  =  identity(%d,  y,d).\n\n",  *m,  *n); 

A  =  identity(*Bi,  en) ; 
break; 

case  SELECT.HILBERT: 

printf  ("\n\nGenerating  A  =  hilbert(y.d,  y.d).\n\n",  em,  vn) ; 

A  =  hilbert(*B,  wn) ; 
break; 

case  SELECT.RAIDOM: 

printf  ("\n\nGenerating  A  =  Bxrand(y,d,  Xd).\n\n".  vm,  en) ; 

A  =  Bxrand(*B,  *n); 
break; 
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401  case  SELECT.VILKIISOI : 

402 

403  printl("\n\nGeneratiiig  k  =  «ilkinson('/,d,  )^d).\n\n'‘,  *m,  en) ; 

404 

405  A  ~  vllkinson(eB,  en); 

406  break; 

407  > 

40S 

409 

410  il  (!A)  { 

411 

412  printf ("generateO :  Allocation  failure  for  the  matrix  A.\n"); 

413  ezit(EXIT_FAILURE): 

414  > 

415 

416  retiumCA); 

417 

418  } 

419  /*  End  generate (}  - */ 

420 

421 

422 

423 

424 

425  /* - s====ss»====  FUICTIOH  DEFIIITIOI  «*»»===**»== - 

426  * 

427  *  Collect  timing  data  from  the  nodes.  The  Intel  side  of  this  function 

428  «  takes  advantage  of  the  host's  ability  to  receive  from  any  node.  The 

429  *  transputer  side  must  receive  every  node’s  information  from  nodes  zero  ft 

430  *  eight  (eight  only  becomes  involved  in  the  case  of  the  hybrid  4-cube) . 

431  */ 

432 

433  ftifdef  PROTOTYPE 

434 

435  double  **receive_tiBing_data(int  cubesize) 

436 

437  #else 

438 

439  double  *«receive_timing_data(cubesize) 

440 

441  int  cubesize; 

442 

443  ftendif 

444  { 


445 

double 

♦♦dt; 

/* 

(double)  version  of  tClG 

*/ 

446 

447 

int 

i. 

448 

j: 

449 

450 

long 

tlen; 

/* 

length  of  one  node's  data 

*/ 
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451 

453  ticks  44t;  /*  ras  tiaing  data  from  nodes  */ 

453 

454 

455  /♦ 

456  e  Perform  allocation  for  the  timing  dt  tDD.  The  tvo-dimensional 

457  e  array  is  indexed  by  node  number  for  the  ross  and  by  event  for  the 

458  *  columns.  For  instance,  t[i]Cj]  means  the  time  required  for  event 

459  e  j  at  node  i.  Actually,  there  is  an  extra  rov  reserved  at  the  end 

460  e  of  tnn  for  totals:  t [cubesize] Cj]  gives  the  total  time  for  event 

461  e  j  across  all  nodes. 

462  ♦/ 

463 

464  if  (!(dt  =  (double  ee)  malloc((cubesize4-l)  e  sizeof (double*) ))){ 

465 

466  printf  ("receive_timing_data() :  Allocation  failure  for  dtnn.\n"); 

467  exit(EXIT_FAILURE); 

468  } 

469 

470  for  (i  =  0;  i  <  (cubesize  +  1);  i++)  { 

471 

472  if  (!(dt[i]  =  (double  «)calloc(NAX_EVEITS,sizeof (double)))){ 

473 

474  printf  ("Host:  Allocation  failure  for  dt  [‘/.d]  .  \n" ,  i) ; 

475  exit(EXIT_FAILURE); 

476  } 

477  } 

478 

479  if  (!(t  =  (ticks  •*)  malloc((cubesize+l)  ♦  sizeof (ticks*))))  { 

480 

481  printf ("receive_timing_data() :  Allocation  failure  for  tnC].\n"); 

482  exit(EXIT_FAILURE); 

483  } 

484 

485  for  (i  =  0;  i  <  (cubesize  +  1);  i++)  { 

486 

487  if  (!(t[i]  =  (ticks  *)  calloc(MAX_EVEITS,  sizeof (ticks))))  { 

488 

489  printf  ("Host;  Allocation  failure  for  t[y.d].\n",  i); 

490  exit(EXIT_FAILURE): 

491  > 

493  } 

493 

494  printf ("Receiving  timing  data  from  the  nodes"); 

495 

496  tlen  =  (long)  (MAX.EVEITS  *  sizeof (ticks)); 

497 

498  for  (i  =  0;  i  <  cubesize;  i++)  •( 

499 

500  printf ("."); 
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SOI 

B02  «ildef  TRAISPUTER 

503 

504  if  (i  <  8)  receive(0.  (char  *)  tCi],  tlen,  cnbesiza); 

BOS  else  receive(8,  (char  *)  t[i],  tlen,  cubesize); 

506 

507  telse  /*  iPSC/2  ♦/ 

508 

509  receive(i,  (char  *)  tCi3,  tlen,  (i  *  lODE.OFFSET)) ; 

510 

511  #endif  /*  TRAISPUTER  ♦/ 

512  > 

513 

514  printf ("\n\n") : 

515 

516 

517  /*  Calculate  totals,  averages;  place  totals  in  t[cubasize]  first.... 

518  *  then  copy  to  dtClC]  and  record  averages  in  dt [cubesize] . 

519  */ 

520 

521  for  (i  =  0;  i  <  cubesize;  i++)  •( 

522 

523  for  (j  =  0;  j  <  NAX.EVEITS;  j++)  t [cubesize] [j]  +=  t[i][j]; 

524  > 

525 

526  /*  Fill  dtClC]  eith  double  values  (in  seconds).  The  convers'  >n 

527  *  factors  are  borrosed  from  timing. h. 

528  •/ 

529 

530  for  (i  =  0;  i  <=  cubesize;  i++)  { 

531 

532  dt [i] [DATA.SOURCE]  =  (double)  t[i] [DATA.SOURCE] ; 

533 

534  for  (j  =  START.TIME;  j  <  MAX.EVEITS;  j++)  { 

535 

536  «ifdef  TRAISPUTER 

537 

538  dt[i][j]  =  ((double)  t[i][j])  *  LO.PERIOD; 

539 

540  telse 

541 

542  dt[i][j]  =  ((double)  t[i][j])  *  M.PERIOD; 

543 

544  tendif 

545  } 

546  > 

547 

548  /*  Convert  totals  to  averages  in  dt [cubesize]  */ 

549 

550  for  (j  =  START.TIME;  j  <  MAX.EVEITS;  j++)  { 
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SSI 

S53 

553 

554 

555 

556 

557 

558 

559 

560 

561 

562 

563 

564 

565 

566 

567 

568 

569 

570 

571 

572 

573 

574 

575 

576 

577 

578 

579 

580 

581 

582 

583 

584 

585 

586 

587 

588 

589 

590 

591 

592 

593 

594 

595 

596 

597 

598 

599 

600 


dtCcubasize] [j]  I-  ((double)  cubesize); 

} 


lor  (i  =  0;  i  <  (cubesize  +  l);  i++)  lree(t[i]); 

Iree(t) ; 

retum(dt) ; 

} 

/*  End  receive.tiaing.dataO - */ 


/♦ - ============  FUICTIOI  DEFIIITIOI  ======»===== - 

e 

e  This  function  analyzes  the  command  line  that  the  user  supplied  and  sets 

*  variables  accordingly.  The  valid  arguments  sire  given  by  deline_valid_ 

*  argsO,  and  the  real  vork  is  passed  off  to  interpret.argsO ,  from  the 

*  clargs  library. 

*/ 

#ifdef  PROTOTYPE 

void  resolve_args(int  argc,  char  *argv[], 

int  *dim,  int  atiming,  int  ^verbose) 

telse 

void  resolve_args(argc,  argv,  dim,  timing,  verbose) 

int  argc ; 
char  *argv [] ; 
int  *dim, 

etiming, 

everbose; 

tendif 

int  maxdim  =  3, 

valid  =  FALSE; 


interpret_args(argc,  argv,  lUMBER.OF.ARGS,  optv);  /♦  see  clargs. h  */ 
«ifdef  TRAISPUTER 

*dim  DIMERSIOM; 
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601 

602  *el8«  /*  iPSC/2  */ 


603 

604 

i!  (optv [DIN] ->!ound)  vdim  ~ 

(int)  optv [DIM] ->lsa [0] ; 

605 

606 

■witch  (edim)  { 

607 

608 

case  0:  case  1:  case  2: 

case  3:  break; 

609 

610 

de!ault:  while  (! valid)  { 

611 

612 

print!("Enter  desired 

cube  dimension  (0...)id) 

613 

■can!("%d",  dim) ; 

614 

!!lu8h(stdin) ; 

615 

616 

8witch(edim)  ■[ 

617 

case  0:  case  1: 

case  2:  case  3: 

618 

valid  =  TRUE; 

619 

break; 

620 

} 

621 

} 

622 

}  /♦  end  switchO  ♦/ 

623 

624  #endil  /*  TRAHSPUTER  •/ 

625 

626  (optv[TIMlIIG]->fouiid)  ?  (*tiiBing  *  TRUE)  :  (*tining  =  FALSE); 

627 

628  (optv [VERBOSE] ->loun.d)  ?  (♦verbose  =  TRUE)  :  (everbose  *  FALSE); 

629 

630  print! ("Argument  resolution  complete. . .\n\n") ; 

631  printfC  Cube  Dimension:  Xd\n",  ♦dim); 

632 

633  i!  (♦timing)  printf("  Timing:  0R\n"); 

634 

635  (♦verbose)  ?  (print!("  Verbose  Node:  0I\n\n"))  ; 

636  (print! ("\n")) ; 

637 

638  > 

639  /♦  End  resolve.argsO  - ♦/ 

640 

641 

642 

643 

644 

645  /♦ - ====s=r=====  FUlCTIOf  DEFIIITIOI  ====»======= - 

646  ♦ 

647  ♦/ 

648 

649  ii!de!  PROTOTYPE 

650 
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void  shov_re8ulting_Batric«8(Doubl«_Natrix_Tjpe  *A, 

Oouble.Natrix.Tjpo  VAO,  int  *q) 

talae 

void  8hov_re8ttlting_aatric88(A,  AO,  q) 

Double.Matrix.Type  *A, 

•AO; 

int  aq: 

#8ndil 

Doubl«_Natrix_Type  *0, 

♦L. 

•LU. 

♦P. 

•QT. 

*QTA, 

♦QTAP , 

•U; 

int  i, 

j. 

B  =  A->roBs, 
n  =  A->col8; 


print! ("Gauss  Factorization  Conplet*. . .\n\n") ; 
8trcpy(A->naBe,  "A  (atter  GF  operations)"); 


/•  Allocate  and  form  Q’  and  P 


if  (!(QT  =  matalloc(m,m)))  { 

print! ("Allocation  failure  for  QT.\n"); 
exit(EXIT_FAILURE); 


8trcpy(QT->naBe,  "Q  Transpose"); 

for  (i  =  0;  i  <  m;  i++)  •(  qT->matrix[i]  [q[i]]  =  1.0;  } 


if  (!(P  =  identity(n,n)))  { 


print! ("Allocation  failure  for  P.\n"); 
exit(EXIT_FAILURE); 
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701  > 

702 

703  8trcpy(P->naae,  "P  [  Partial  (coluan)  Pivoting  ==>  P  ==  Identity  ]"); 

704 

705 

706  /e  Here,  ve  alovly  Ion  Q'AP,  keeping  in  aind  that  the  A  ve  are 

707  e  talking  about  ie  the  original  A . . . . and  «a  have  labeled  that  one 

708  *  AO.  Therefore,  ve  first  fon  QTA  (Q*A)  as  Q'  *  AO.  After  se 

709  *  have  QTA,  ve  can  multiply  it  (on  the  right)  by  P  to  get  Q’AP, 

710  *  or  QTAP  as  it  is  called  here. 

711  */ 

712 

713  if  (!(QTA  =  matalloc(a,n)))  { 

714 

715  printf ("Allocation  failure  for  QTA.Xn”); 

716  exit(EXIT_FAILURE); 

717  > 

718 

719  strcpy(qTA->naBe,  "Q’  ♦  (original)  A"); 

720 

721  if  (Batrix_product(QT,  AO,  QTA)  ==  FAILURE)  { 

722 

723  printf ("Batrix_product(QTA)  Failure . \n" ) ; 

724  exit(EXIT_FAILURE); 

725  > 

726 

727 

728  if  (!(QTAP  *  Batalloc(B,n)))  { 

729 

730  printf ("Allocation  failure  for  QTAP.Nn"); 

731  exit(EXIT_FAILURE); 

732  > 

733 

734  strcpyfQTAP->naBe,  "Q’  ♦  A  ♦  P"); 

735 

736  if  (Batrix_product(QTA,  P,  QTAP)  ==  FAILURE)  { 

737 

738  printf ("aatrix.product (QTAP)  Failure . \n" ) ; 

739  exit(EXIT_FAILURE); 

740  > 

741 

742 

743  /e  lext,  ve  fon  L  and  U  so  that  ve  can  compare  Q'AP  ?=?  LU.  */ 

744 

745  L  =  zeros (b,  n);  L->naBe  =  "L  "; 

746  U  »  zeros (b,  n) ;  U->naBe  =  "U  "; 

747 

748  for  (i  *  0;  i  <  A->rov8;  i++)  { 

749 

750  for  (j  =  0;  j  <  A->co1b;  j4-+)  { 


332 


751 

752  if  (i  <  j)  {  U->«atrix[i3 [j]  =  i->»atrix[i] [j] ;  > 

753 

754  il  (i  ==  j)  { 

755 

756  L->»atrixCi]  [j]  =  1.0; 

757  U->«atrixCi] Cj]  =  *->«atrix[i3 Cj] ; 

758  } 

759 

760  il  (i  >  j)  {  L->*atrixCi] Cj]  *  A->«atrix[i] [j] ;  > 

761  } 

762  > 

763 

764  il  (!(LU  =  Batalloc(B,n)))  { 

765 

766  printK "Allocation  failure  lor  LU.\n"); 

767  exit(EXIT_FAILURE); 

768  } 

769 

770  8trcpy(LU->naBe,  "L  *  U"); 

771 

772  if  (Batrix_product(L,  U,  LU)  ==  FAILURE)  -f 

773 

774  printl("Batrix_product(LU)  Failure. \n"); 

775  exit(EXIT.FAILURE): 

776  } 

777 

778 

779  /♦  Finally,  se  create  a  aatrir.  of  differences  between  the  elements 

780  ♦  found  in  QTAP  (Q’AP)  and  LU.  Il  everything  proceeded  according 

781  *  to  the  plan,  this  will  be  a  Batrix  of  zeros. 

782  */ 

783 

784  if  (!(D  =  Batalloc(B,n)))  { 

785 

786  print! ("Allocation  failure  for  D.\n"): 

787  exit(EXIT_FAILURE); 

788  > 

789 

790  strcpy (D->naae ,  "Q’AP  -  LU"); 

791 

792  for  (i  =  0;  i  <  b;  i4-+)  { 

793 

794  for  (j  =  0;  j  <  n;  j++)  i 

795 

796  D->Batrix[i] Cj]  =  <QTAP->BatrixCi] Cj]  -  LU->BatrixCi] Cj] ) : 

797  > 

798  } 

799 

800  printad(*A,  WIDTH,  AFT); 
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801 

printf  ("\n\n'') ; 

802 

printmd(*L,  WIDTH,  AFT); 

803 

printf ("VnXn"); 

804 

printmd(*U.  WIDTH.  AFT); 

805 

printf  (''\n\n") ; 

806 

807 

printmd(*QT,  WIDTH.  AFT); 

808 

printf ("\n\n") ; 

809 

printmd(*P.  WIDTH,  AFT); 

810 

printf  ("\n\n*') ; 

811 

printmd(*QTA.  WIDTH.  AFT); 

812 

printf  ("\n\n’') ; 

813 

printBd(*qTAP.  WIDTH.  AFT); 

814 

printf  (*'\n\n") ; 

815 

printmd(*LU.  WIDTH,  AFT); 

816 

printf ("\n\n") ; 

817 

printmd(*D,  WIDTH.  AFT); 

818 

printf  ("\n\n'') ; 

819 

820  } 

821  /♦  End  8hov_resulting_matric«8()  - */ 

822 

823 

824 

825 

826 

827  /• - FUICTIOI  DEFIIITIOI  - - 

828  * 

829  *  This  is  a  simple  function  to  physically  snap  the  elements  from  rov  s  to 

830  *  the  current  pivot  row,  r.  It  does  not  concern  itself  vith  column  r  or 

831  *  any  column  j  >  r. 

832  ♦/ 

833 

834  #ifdef  PROTOTYPE 

835 

836  void  SBap_rou8_left_of_pivot(Double_Matrix_Type  *A,  int  r,  int  s) 

837 

838  #else 

839 

840  void  seap_ros8_left_of_pivot(A,  r,  s) 

841 

842  Double.Natriz.Type  *K ; 

843  int  r , 

844  s; 

845 

846  fendif 

847  { 

848  double  tmp ; 

849 

850  int  1 : 


int  j; 
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asi 

852 

853 

854 

855 

856 

857 

858 

859 

860 
861 
862 

863 

864 

865 

866 

867 

868 

869 

870 

871 

872 

873 

874 

875 

876 

877 

878 

879 

880 
881 
882 

883 

884 

885 

886 

887 

888 

889 

890 

891 

892 

893 

894 

895 

896 

897 

898 

899 

900 


lor  (j  =  0;  j  <  r;  j++)  { 

tap  =  A->«atrix[r] [j] ; 
l->*atrix[r] [j]  =  l->«atrix W [j]; 
A->*atrixi;8]  [j]  =  tap; 


} 

/♦  End  8wap_rov8_l«f t_of .pivot (}  -  */ 


/* - puicTIOI  DEFIIITIOI  ======»==== - 

* 

*  This  lunction  parloras  updates  to  a  perautation  vector,  vQ,  ol  length 

*  'size’.  The  pivot.index  indicates  the  row  or  colnan  shere  the  next 

*  pivot  has  been  located;  and  k  indicates  the  stage,  or  the  ros  and 
e  coltian  shere  the  pivot  is  to  be  installed. 

*/ 

#ildel  PROTOTYPE 

void  update.pemutationCint  vC],  int  size,  int  k,  int  pivot.index) 
Aelse 

void  update_perautation(v,  size,  k,  pivot.index) 

int  V  []  , 
size, 

R. 

pivot.index; 

tendil 

int  i; 


i  *  vCk];  v[k]  =  v [pivot.index] ;  v [pivot.index]  =  i; 

} 

/*  End  npdate.perautationO - */ 
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901  PROTOTYPE  /♦  *=============*=*:=======**======*================  */ 

903 

903  BainCint  argc,  char  ^argvH) 

904 

905  *«lt6 

906 

907  ■ain(argc,  argv) 

908 

909  int  argc ; 

910  char  ♦argv  []  ; 

911 

912  tendiY 

913  { 

914 


915  /♦ 

ABLE 

DEFIIITIOIS  ========== - 

*/ 

916 

917 

double  a. 

/* 

denominator  of  growth  factor  (g/a) 

♦/ 

918 

♦cbuY , 

/♦ 

col  buffer  holds  one  col  at  a  time 

*/ 

919 

♦♦dtime. 

/♦ 

doubles  corresponding  to  ticks  set 

♦/ 

920 

eps  =  epsdO. 

/♦ 

machine  precision  (see  machine. h) 

♦/ 

921 

g  =  0.0, 

/* 

the  growth  factor 

*/ 

922 

root_tiBe, 

/♦ 

time  measured  at  root  for  iterations 

*/ 

933 

tol; 

/* 

tolerance 

♦/ 

924 

925 

Double_Matrir_Type  *A, 

/• 

This  A  gets  operated  upon/changed 

*/ 

926 

♦AO; 

/* 

The  original  copy  of  A 

♦/ 

927 

928 

int  cubes ize, 

/* 

number  of  processors  in  the  cube 

♦/ 

929 

dim, 

/♦ 

dimension  of  the  hypercube 

♦/ 

930 

i. 

931 

j. 

932 

m. 

/• 

number  of  rows  in  A 

*/ 

933 

■«, 

/* 

root  processor’s  id 

•/ 

934 

n, 

/* 

number  of  :ols  in  A 

♦/ 

935 

*q. 

/♦ 

row  permutation  vector 

*/ 

936 

r. 

/• 

numerical  rank  estimate 

♦/ 

937 

timing, 

/• 

Boolean 

♦/ 

938 

verbose; 

/* 

Boolean 

♦/ 

939 

940 

long  sizeof.col. 

/• 

sizes,  in  bytes 

♦/ 

941 

sizeoT.int, 

942 

sizeof .pivot; 

943 

944 

ticks  root.start. 

945 

t.root. 

/♦ 

time  measured  at  root  transputer 

♦/ 

946 

♦♦t; 

/• 

time  data:  row  =>  node,  col  =>  event 

♦/ 

947 

948 

Pivot .Type  pivot; 

/• 

pivot 

•/ 

949 

950 
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951  /* - =============  IIITIALIZATIOIS  ============= - ♦/ 

952 

953  TRAISPUTER 

954 

955  t*  Add  IN  to  the  heap  to  allow  for  generation  of  large  Batrices  */ 

956  addfree((void  e)  .heapend,  OzlOOOOO); 

957 

958  *endif 

959 

960  printf ("\n*/.8\n\n*',  version): 

961 

962  define.valid.argsO ; 

963 

964  resolve_args(argc,  argv.  Adia.  Atiaing,  Averbose); 

965 

966  A  =  generate(Am,  An); 

967 

96S  sizeof.col  =  (long)  (A->rovs  *  sizeof (double)) ; 

969  sizeof _int  =  (long)  sizeof (int); 

970  sizeof.pivot  =  (long)  sizeof (Pivot_Type) ; 

971 

972  if  (!(cbuf  =  (double  *)  aalloc(sizeof_col)))  { 

973 

974  printf  ("aainO  :  Allocation  failure  for  cbufC].\n"): 

975  ezit(EXIT_FAILURE); 

976  } 

977 

97S  cttbesize  s  p0V2(dim): 

979 

980  #ifdef  TRAISPUTER 

981 

982  initialize_h7percube(diB) ; 

983 


984  #else 

985 

986  cubenaae  =  initialize_hypercube(diB,  nodecode); 

987 

988  Aendif 

989 

990 

991  Be  =  Byhost  0; 

992 

993  if  (verbose)  { 

994 

995  if  (!(A0  =  Batalloc(B,n)))  { 

996 

997  printf ("Allocation  failure  for  A0.\n"); 

998  exit (EXIT.FAILURE) ; 

999  } 

1000 
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1001  8trcpy(A0->naBe,  “Original  A"}; 

1002 

1003  lor  (i  =  0;  i  <  A->row8;  i++)  { 

1004  lor  (j  =  0;  j  <  A->col8;  j++)  < 

1005 

1006  AO->Batrix[i] Cj]  =  A->«atrixCi] [j] ; 

1007  > 

lOOS  > 

1009  printl("\n\nA  ha8  b««n  allocated  and  generated. \n\n“} ; 

1010  printBd(*A.  WIDTH.  AFT); 

1011  printl(“\n\nSending  aize(A)  to  the  node8.\n\n"); 

1012  > 

1013 

1014 

1015  #ildel  TRAISPUTER 

1016 

1017  cnbecastCme,  dim,  (char  *)  An.  8izeol_int.  cubeaize) ; 

1016  cubecastCne.  dim.  (char  *)  An.  eizeol.int.  cnbesize) ; 

1019  cubecast(Be.  dim.  (char  *)  Atiming.  aizeof.int.  cnbesize); 

1020 

1021  #«l8«  /♦  iPSC/2  •/ 

1022 

1023  cubeca8t(Be.  dim,  (char  e)  Am.  eizeol.int.  ROW.SIZE.TYPE) ; 

1024  cubeca8t(me.  dim,  (char  *)  An.  eizeol.int.  COL.SIZE.TYPE) ; 

1025  cubecast(me,  dim,  (char  *)  Atiming.  eizeol.int.  ARG.TYPE) ; 

1026 

1027  fendil 
1026 

1029  il  (verbose)  printl("\nSent  8ize(A)  to  nodes. \n”); 

1030 

1031  dietribute_column8(A.  din.  cbnl); 

1032 

1033  q  =  initial_permutation_vector(n) ; 

1034 

1035 

1036  /*  FIIAL  PREPARATIOIS  BEFORE  STARTIIG  THE  ITERATIOI  - 

1037  * 

1036  *  Get  the  lirst  pivot  Iron  node  0.  Initialize  the  growth  lector 

1039  •  variables,  g  and  a,  so  that  we  can  compute  growth  lector  (g/a)  as 

1040  *  we  go.  Set  a  reasonable  tolerance. 

1041  * 

1042  *  - 

1043  •/ 

1044 

1045  Aildel  TRAISPUTER 

1046 

1047  receive(0,  (char  *)  Apivot,  sizeol.pivot,  cnbesize); 

1046 

1049  Aelse  /•  iPSC/2  */ 

1050 
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1051  r«c«iv6(0,  (char  *)  tpivot.  aizeof.pivot,  PIV0T_TYPE); 

1052 

1053  ««adil  /«  TRilSPUTER  */ 

1054 

1055 

1056  a  s  g  =  NAX(g,  labs (pivot. n) ) ; 

1057 

1058  tol  =  (MII(m,n})  *  g  *  sps; 

1059 

1060 

1061  /*  BEGIIIIIG  OF  ITERATIOI  - 

1063  * 

1063  *  Vs  enter  vith  A  established  and  knosledge  ol  the  first  pivot. 

1064  * 

1065  *  - 

1066  ♦/ 

1067 

1068  tildel  TRAKSPUTER 

1069 

1070  root.start  =  clockO; 

1071 

1073  fendif 

1073 

1074  pr inti ("Beginning  iterations.NnNn") ; 

1075 

1076  lor  (r  s  0;  r  <  (KIM(m,n));  r++)  { 

1077 

1078  il  (pivot. id  ==  RAIK.DEFICIEIT)  break; 

1079 

1080  /*  We  expect  to  receive  cbulD  in  the  correct  (i.e.,  already 

1081  *  swapped)  order.  Before  we  stuff  cbulD  into  AC][],  we’ll  swap 

1082  *  rows  left  of  the  pivot  column,  and  then  insert  the  new  pivot 

1083  *  column. 

1084  •/ 

1085 

1086  «ifdef  TRAISPUTER 

1087 

1088  receive(0,  (char  *}  cbuf,  sizeof_col,  cubesize); 

1089 

1090  false  /*  iPSC/2  •/ 

1091 

1092  receive(0,  (chzir  *)  cbuf,  sizeol.col,  PCOL.TYPE) ; 

1093 

1094  fendif  /*  TRAISPUTER  */ 

1095 

1096  g  =  KAX(g,  f abs (pivot. u) ) ; 

1097 

1098  update_permutation(q,  m.  r,  pivot. s); 

1099 

1100  if  (pivot. s  !=  r)  swap_row8_left_of_pivot(A,  r,  pivot. s); 


3.39 


1101 

1102  lor  (i  =  0;  i  <  A->roBB;  i++)  {  A->«atrixCi] [r]  =  cbuiCi];  } 

1103 

1104  if  (verboss)  { 

1105 

1106  printf ("Host:  Stage  y,d.  Pivot  value  =  Xe.  ",  r,  pivot. u); 

1107  printf ("Grovth  factor  =  Xe.\n",  (g/a)); 

1108  printfC'q  ®  ");  printvi(q,  A->roBB,  WIDTH); 

1109  printf ("\n") ; 

1110 

nil  > 

1112 

1113  if  (r  <  ((MIH(B,n))  -  1))  •{ 

1114 

1115  «ifdef  TRAISPUTER 

1116 

1117  receive(0.  (char  *)  Apivot,  sizeof.pivot,  cubesize); 

1118 

1119  #else  /♦  iPSC/2  */ 

1120 

1121  receive(0,  (char  ♦)  Apivot,  sizeof _pivot ,  PIV0T_TYPE); 

1122 

1123  Aendif  /♦  TRAISPUTER  ♦/ 

1124 

1125  > 

1126 

1127  }  /*  end  for(r)  - */ 

1128 

1129  Aifdef  TRAISPUTER 

1130 

1131  t_root  =  (clockO  -  root_8tart); 

1132 

1133  if  (timing)  i 

1134 

1135  root_time  =  ((double)  t_root)  ♦  L0_PERI0D; 

1136 

1137  printf ("\n\nRoot  transputer:  "); 

1138  printf  ("Time  for  iterations:  y,8.41f  seconds \n\n" ,  root.time) ; 

1139  > 

1140 

1141  Aendif 

1142 

1143 

1144  free(ctuf); 

1145 

1146 

1147  /*  I  have  selected  the  easy  say  out  and  assumed  A  has  full  rank.  If 

1148  *  you  did  not  make  this  assumption,  you  vould  need  to  collect  the 

1149  *  remaining  columns  at  this  point. 

1150  */ 
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1152  if  (tining)  dtia*  =  r«c«iv«_tiBing_data(cub«>iz«) ; 


1155  /*  There  is  no  acre  uee  for  the  nodee,  so  they  can  be  released.  */ 

1156 

1157  «ifndef  TRAISPUTER 

1158  printf ("VnXnaainO :  Killing  and  releasing  cube.\n\n"}: 

1159  killcubeCALL.IODES,  ALL.PIDS); 

1160  relcubeCcubenane) ; 

1161  fendif 


1163  if  (verbose)  {  /e  Create  and  shos  Q',  AO.  P.  L,  U  .... 


shov.resulting.matricesCA,  AO.  q); 


1167  } 


1169 

1170  if  (timing)  di8play_timing_data(A.  dim.  a.  eps.  g.  tol,  r.  dtime) ; 

1171 

1172  > 

1173  /♦ - =  =  =  aa==  =  =  a==s  EOF  gfpphost.C  - - •/ 
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1  /* - »===s=ssr  PROGRAM  IIFORMATIOI  ========== - 

2  * 

3  *  SOURCE  gf ppnode . c 

4  *  VERSIOM  :  2.0 

s  *  DATE  21  Septaabar  1991 

6  a  AUTHOR  Jonathan  E.  Hartaan.  U.  S.  laval  Postgraduata  School 

7  *  REMARKS  :  Saa  gf.h. 

8  * 


10  */ 

11 

12  tincluda  <aath.h> 

13 

14  «ildaf  TRAISPUTER 

15 

16  tincluda  <conc.h> 

17 

18  tincluda  <Batrix.h> 

19  tincluda  <Bacros.h> 

20  tincluda  <allocate.h> 

21  tincluda  <coiiun.h> 

22  tincluda  <ganarata.h> 

23  tincluda  <B3thx.h> 

24  tincluda  <op8.h> 

25  tincluda  <tiBing.h> 

26 

27  talsa 

28 

29  tincluda  "/usr/hartaan/aatlib/aatrix.h** 

30  tincluda  "/usr/hartaan/aatlib/nacros.h" 

31  tincluda  "/uar/hartaan/aatlib/allocata.h" 

32  tincluda  "/usr/hartaan/aatlib/coBB.h'' 

33  tincluda  "/usr/hartaan/aatlib/ganarata.h" 

34  tincluda  ’'/usr/hartaan/aatlib/aathx.h" 

35  tincluda  "/usr/hartBan/aatlib/ops .h" 

36  tincluda  "/uar/hartaan/natlib/tiaing.h" 

37  tandif 

38 

39  tincluda  "gf.h" 

40 

41  tifdaf  TRAISPUTER 

42 

43  Channal  aic[(CU6ESIZE  *  1)], 

44  aocCCCUBESIZE  ♦  1)]; 

45 

46  tandif 

47 

48 

49  ticka  t[MAX_EVERTS] ; 

50 
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51 

53 

53 

54 

55  /* - FUICTIOI  DEFIIITIOI  =========== - 

56  « 

57  *  This  function  is  kind  of  an  inverss  for  local.colnmnO .  Civan  sobs 

58  *  column  number  (local.column)  held  at  this  node,  the  function  returns 

59  *  the  corresponding  column  number  in  the  global/host  copy  of  the  full- 

60  e  sized  A.  This  could  be  implemented  more  efficiently  as  a  macro. 

61  ♦/ 

62 

63  #ifdef  PROTOTYPE 

64 

65  int  global.columnCint  local.column,  int  me,  int  cubes ize) 

66 

67  #else 

68 

69  int  global.columndocal.column,  me,  cubesize) 

70 

71  int  local.column, 

72  me, 

73  cubesize; 

74 

75  fendif 

76  < 

77  returnClocal.column  *  cubesize  *  me); 

78  > 

79  /♦  End  global.columnO  - */ 

80 
81 
82 

83 

84 

85  /* - ==r========  FUICTIOM  DEFIIITIOI  =========== - 

86  ♦ 

87  *  This  function  maps  a  column  number  in  the  global  A  (the  full-sized  A 

88  *  held  at  the  root  processor/host)  to  the  corresponding  local  column  num- 

89  *  ber.  If  the  global.column  is  not  one  that  is  held  at  this  node,  a 

90  *  negative  value  (-1)  is  returned. 

91  */ 

92 

93  #ifdef  PROTOTYPE 

94 

95  int  local.column (int  global.column,  int  me,  int  cubesize) 

96 

97  false 

98 

99  int  local. columnCglobal. column,  me,  cubesize) 

100 


343 


gfppnode.c 


101  int  global.coliuui , 

103  ■«, 

103  cubesiz«: 

104  tendil 

105  { 

106  if  ( (global.coluan  7,  cubasize)  !=  ae)  r«tuzii(-l); 

107 

106  Tetiirn((int)  global.colunn  /  cubasize); 

109  } 

110  /♦  End  local.coluanO - */ 

111 
112 

113 

114 

115 

116  /* - ===========  FUlCTIOM  DEFIIITIOI  =========== - 

117  ♦ 

118  ♦/ 

119 

120  #ifdal  PROTOTYPE 

121 

122  void  do_pivot_coluiim_arithiiietic(Doubla_llatrix_Typa  *A,  donbla  acbul, 

133  int  k,  int  na,  int  cubasiza) 

134 

125  falsa 

126 

127  void  do_pivot_column_arithaetic(A,  cbuf,  k,  bo,  cubasiza) 

128 

129  Double_Matrix_Typo  *A; 

130  doubla  vcbul ; 

131  int  k, 

132  Be, 

133  cubasiza; 

134 

135  fandif 

136  { 

137  double  pivot_value; 

138 

139  int  i , 

140  pivot.column; 

141 

142 

143  pivot.coluBn  =  local_colunin(k.  Be,  cubasiza); 

144 

145  pivot .value  =  A->Batrix[k] Cpivot.coluan] ; 

146 

147 

148  /*  Divide  everything  under  the  pivot  by  the  pivot  value  */ 

149  lor  (i  =  (k+1);  i  <  A->row8;  i4+)  < 

150 
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151 

153 

153 

154 

155 

156 

157 

158 

159 

160 
161 
163 

163 

164 

165 

166 

167 

168 

169 

170 

171 
173 

173 

174 

175 

176 

177 

178 

179 

180 
181 
183 

183 

184 

185 

186 

187 

188 

189 

190 

191 
193 

193 

194 

195 

196 

197 

198 

199 
300 


A->Batrix[i] [pivot.coluon]  /=  pivot_»alu«; 

> 


/*  This  is  soaeshat  rsdundant,  and  not  optiaal  with  respect  to 
*  efficiency,  but  it  eorks  and  reads  clearly,  right? 

*/ 

for  (i  =  0;  i  <  A->rows;  i++)  cbuf[i]  =  A->Batrix[i] [pivot.colunn] ; 


> 

/*  End  do_pivot_coluBn_arithaetic()  -  ♦/ 


/♦ - ===========  FUICTIOI  DEFIIITIOI  =========== - 

* 

*  This  function  accepts  the  natrix,  the  global  coluan  number  for  this 

*  stage  (uhere  the  pivot  sill  be  taken  from),  and  a  pivot  structure  to  be 

*  f illed. .. .among  other  things ... .and  'returns*  the  ros,  s,  and  vidue,  u, 

*  of  the  nes  pivot  in  global  column  r  (local  column  Ic). 

*/ 

#ifdef  PROTOTYPE 

void  locate_pivot(int  me,  int  cubesize,  Double.Natrix.Type  *A,  int  r, 
Pivot.Type  epivot) 


#else 

void  locate_pivot(me,  cubesize.  A,  r,  pivot) 

int  me, 

cubesize; 

Double_l!atrix_Type  *A: 
int  r ; 

Pivot_Type  epivot; 


tendif 

{ 

int  i, 

pivot.column; 


pivot.column  =  local_column(r ,  me,  cubesize); 


*/ 


/•  Initialize  pivot  ros  and  value 
pivot->s  =  r; 
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201 

202 

203 

204 

205 

206 
207 
203 

209 

210 
211 

212  > 

213  /* 

214 

215 

216 

217 

218 

219  /* 

220  * 
221  * 
222  ♦ 

223  * 

224  ♦ 

225  * 

226  * 

227  * 

228  * 

229  ♦ 

230  ♦ 

231  * 

232  * 

233  • 

234  ♦ 

235  * 

236  * 

237  * 

238  * 

239  ♦ 

240  * 

241 


/ 


pi»ot->u  =  A->»atrixCr]  [pivot .'-oliiam] ; 


lor  (i  =  (r+1);  i  <  A->roB8:  i++)  { 

il  (l&bs(A->aatrixCi] [pivot .coluan])  >  labs (pivot->u) )  { 
pivot->8  =  i; 

pivot->u  =  A->Batrix [i3 [pivot.coluan] ; 

> 

> 

End  locats_pivot()  - */ 


- FUICTIOI  DEFIimOI  =========== - 

Rsceive  this  nods's  columns  Iront  tbs  rootAost  processor  (manager), 
place  them  into  the  column  buller,  then  transler  them  into  A  while 
the  other  processors  are  communicating  with  the  root. 

The  transputer  scheme  is  a  bit  more  involved.  Here  nodes  0000  and  1000 
are  connected  to  the  root  and  they  must  receive  lor  everyone.  They  (0 
and  8)  are  not  directly  connected  to  everyone,  so  the  columns  must  be 
passed  out  in  cycles.  For  instance,  suppose  we  used  the  hybrid  4-cube. 
Then  nodes  0  and  8  would  receive  bursts  ol  8  columns  at  a  time.  They 
would  keep  the  lirst  one  (we’ll  call  it  column  0  in  some  sort  ol  rela¬ 
tive  numbering  scheme  that  abides  by  the  C  numbering  convention) ,  send 
the  next  one  (col  1)  in  the  0x1  direction,  the  next  to  the  0x2  direc¬ 
tion,  column  3  in  the  0x1  direction,  column  4  in  the  0x4  direction, 
column  5  in  the  0x1  direction,  column  6  in  the  0x2  direction,  and 
lastly,  column  7  in  the  0x1  direction.  This  makes  cycle  ==  8  lor  nodes 
0000  and  1000.  Similarly,  nodes  xOOl  have  a  cycle  ol  lour  where  they 
keep  the  lirst  column  to  arrive  and  then  send  the  next  three  to  direc¬ 
tions  0x2,  0x4,  and  0x2  in  turn.  This  distribution  pattern  is  main¬ 
tained  until  all  ol  the  columns  have  been  distributed. 


242  tildel  PROTOTYPE 


243 

244 

void  receive_columns(int 

dim. 

245 

int 

node. 

246 

Double. 

.Matrix_Type  *A, 

247 

int 

n. 

248 

double 

♦cbul , 

249 

int 

my.cols , 

250 

int 

colsize) 
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3S1 

353  *els« 

353 

354  void  receive_coliums(diB,  node.  A,  n,  cbul,  ny.cola,  colsize) 

355 


256 

int 

dim. 

257 

node; 

358 

Double.Matriz.Type  *A; 

359 

int 

n; 

360 

double 

♦cbuf ; 

361 

int 

my.cols. 

363 

colsize: 

363 

364  lendif 

365  { 

366  int 

cubesize 

=  pow2(dim) , 

267 

cycle. 

/* 

length  of  typical  col  burst 

*/ 

368 

dimeff 

=  MII(3,  dim). 

/* 

effective  dimension 

*/ 

369 

from, 

/♦ 

node  that  I  receive  from 

•/ 

270 

«c. 

/* 

global  column  index 

•/ 

271 

i. 

272 

idx, 

/* 

index  into  toC] 

*/ 

273 

Ic 

=  0, 

/* 

local  column  index 

*/ 

274 

Ideff , 

/• 

effective  least.dimensionO 

*/ 

275 

nodeff 

=  (node  */,  8) , 

/* 

effective  node  number 

*/ 

276 

others , 

/* 

no.  of  nodes  in  other  3-cube 

*/ 

277 

step. 

/• 

for  destination  of  cols  rec' 

d*/ 

278 

thehost 

=  nyhostO , 

279 

to  [8]  : 

/* 

==>  direction  to  send  to 

*/ 

280 

281 

282  «ildef  TRAISPUTER 


263 

284 

285 

286 

287 

288 

289 

290 

291 

292 

293 

294 

395 

396 

397 

298 

299 

300 


Ideff  =  least_dimension(nodelf } ; 

il  (nodefl  ==  0)  Iron  =  nyhostO; 

else  from  =  node  '  pov2(ldeff  -  1); 

cycle  describes  the  length  of  a  cycle  that  starts  with  ne  (node) . . . 

*  then  I  receive  several  columns  for  others .... then  start  over  with 

*  ne.  The  nodes  in  the  highest  dimension  have  cycle  ==  1  self 

*  only.  He  also  fill  to[]  with  the  directions  that  we  will  be 

e  sending  to  within  a  given  cycle,  lot  all  nodes  use  all  8  elements 

*  of  to[].  They  only  use  the  first  cycle  elements.  The  step  is  the 
e  difference  between  the  column  numbers  received  at  this  node  during 

*  a  given  burst  of  length  cycle, 
e 

*  When  we  use  the  hybrid  4-cube,  we  are  treating  it  as  two  3-cubes, 
e  so  the  variable  others  is  set  to  8.  This  is  because  there  are  8 

*  other  columns  between  every  burst  that  cones  to  the  3-cube  that 
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301  *  node  is  in. 

302  */ 

303  cycle  =  po«2(diaeff  -  Idsll) ; 

304 

305  (din  ==  4)  ?  (others  *  8)  :  (others  *  0); 

306 

307  Step  -  po«2(ldell); 

308 

309  toCO]  =  0; 

310  to[l]  =  to[3]  =  to[6]  =  toC7]  =  pov2(ldefl): 

311  to[2]  =  toCe]  =  poB2(ldelf  +  1); 

312  to[4]  =  poB2(ldell  ♦  2); 

313 

314 

315  for  (gc  =  node;  gc  <  n;  gc  +=  (others  +  step))  { 

316 

317  receive(froB,  (char  «)  cbul.  colsize,  cubesize); 

318 

319  for  (i  =  0;  i  <  A->roB8;  i++)  A->Batriz[i] [Ic]  =  cbuf[i]; 

320 

321  lc++: 

322 

323  for  (idx  =  1;  idx  <  cycle;  idx4-4-)  { 

324 

325  gc  +=  Step; 

326 

327  if  (gc  <  n)  { 

328 

329  receive(froB,  (char  e)  cbuf,  colsize,  cubesize); 

330 

331  dir ectional.s end (node,  dia,  to[idx],  (char*)  cbuf,  colsize); 

332  } 

333  } 

334 

335  }  /*  end  for(gc)  */ 

336 

337 

338  #else  /*  iPSC/2  */ 

339 

340  for  (Ic  =  0;  Ic  <  ny.cols;  lc++)  •{ 

341 

342  receive (thehost,  (char  *)  cbuf,  colsize,  COL.TYPE) ; 

343 

344  for  (i  =  0;  i  <  A->rows;  i4+)  ■(  A->Batrix[i]  Clc]  =  cbufCi];  } 

345  } 

346 

347  #endif  /♦  TRAISPUTER  */ 

348 

349  > 

350  /*  End  receive.coluansO  - */ 
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351 

352 

353 

354 

355 

356 

357 

358 

359 

360 

361 

362 

363 

364 

365 

366 

367 

368 

369 

370 

371 

372 

373 

374 

375 

376 

377 

378 

379 

380 

381 

382 

383 

384 

385 

386 

387 

388 

389 

390 

391 

392 

393 

394 

395 
39C 

397 

398 

399 

400 


/* - FUICTIOI  DEFIIITIOI  =========== 

= 

*  This  function  sends  in  the  tiaing  data  that  is  held  in  tQ. 

*/ 

#ifdef  PROTOTYPE 

void  submit .tining.dat a (int  node,  int  dim) 
telse 

void  submit_timing_data(node.  dim) 

int  node, 
dim; 

#endif 

i 


int  dimeff 

=  MlKdim,  3), 

dir. 

i. 

Id 

s  least.dimens  ion  (node  */,  8), 

nodeff 

-  (node  •/,  8) , 

root 

=  myhostO ; 

long  cubesize 

=  pov2(dim) , 

tlen; 

tlen  =  (long)  (MAX.EVEHTS  v  sizeof (ticks)) ; 
fifdef  TRAISPUTER 

submit(node,  dim,  (char  e)  t,  tlen,  cubesize); 
if  (dimeff  ==  Id)  return; 
if  ((nodeff  ==  2)  II  (nodeff  ==  3))  ■{ 
if  (dimeff  >  2)  { 

directional_ieceive(node,  dim,  0x4,  (char  *)  t,  tlen); 
submit(node,  dim,  (char  *)  t,  tlen,  cubesize); 

> 

return ; 
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401 

402  il  (nodall  *=  1)  {. 

403 

404  il  (dimell  >  1)  { 

405 

406  directional.raceivaCnoda,  dia,  0x2,  (char  *)  t,  tlen); 

407  sttl»iit(noda,  dia,  (char  *)  t,  tlan,  cubaaiza) ; 

408  } 

409 

410  11  (diaall  >  2)  { 

411 

412  diractional_racaiva(Boda,  dia.  0x4,  (char  *)  t,  tlan); 

413  subBit(noda,  dia,  (char  a)  t.  tlan,  cubaaiza); 

414  diractional_racaiva(noda.  dia,  0x2,  (char  *)  t,  tlan); 

415  aubmit(noda,  dia.  (char  a)  t,  tlan,  cubaaiza): 

416  > 

417 

418  raturn; 

419  } 

420 

421  il  (nodell  ==  0)  { 

422 

423  il  (diaall  >  0)  { 

424 

425  /a  ratrana  Iroa  1  or  9  - */ 

426  diractional_racaiva(node,  dia,  0x1,  (char  a)  t,  tlen); 

427  8ubait(noda,  dia,  (char  a)  t,  tlen,  cubaaiza); 

428  } 

429 

430  il  (diaall  >  1)  { 

431 


432  /♦  ratrana  Iron  2  or  10  - a/ 

433  diractional_racaiva(node,  dia,  0x2,  (char  a)  t,  tlen); 

434  Bubait(noda,  dia,  (char  a)  t,  tlan,  cubaaiza); 

435  /*  ratrana  Iroa  3  or  11  - a/ 


436  directional_recaiva(noda,  dia,  0x1,  (char  a)  t,  tlan); 

437  Bubait(noda,  dia,  (char  a)  t,  tlan,  cubaaiza); 

438  > 

439 

440  il  (diaall  >  2)  { 

441 


442  /a  ratrana  Iroa  4  or  12  - a/ 

443  diractional_racaiva(noda,  dia,  0x4,  (char  a)  t,  tlan); 

444  Bubait(noda,  dia,  (char  a)  t.  tlen,  cubaaiza); 

445  /♦  ratrana  Iroa  5  or  13  - a/ 

446  directional_racaiTa(noda,  dia.  0x1,  (char  a)  t,  tlan); 

447  aubait(noda,  dia,  (char  a)  t,  tlan,  cubaaiza); 

44g  ratrana  Iron  6  or  14  - •/ 


449  diractional_racaive(noda,  dia.  0x2,  (char  a)  t,  tlen); 

450  BUbait(node,  dia,  (char  a)  t,  tlan.  cubaaiza); 
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SOI  Ic  =  0; 

B02 

503  ticks  start; 

504 

505 

506  while  ((gc  =  global.colusmdc,  we,  cubesize})  <=  k)  Ic-t-f; 

507 
506 

509  /*  The  pivot  row  is  k  and  we  know  that  Ic  is  the  first  local  colnan  to 

510  *  the  right  of  k.  low  we  nnst  wove  through  the  Gauss  Transform  area, 

511  e  all  A(i,j)  where  i  >  k  and  j  >  k,  and  perform  the  operation: 

512  * 

513  •  A(i,j)  =  A(i,j)  -  A(i,k)  ♦  A(k,j)  <==>  A(i,j)  -=  cbuf [i]*A(k, j) 

514  */ 

515 

516  Start  =  clockO  ; 

517 

518  for  (i  =  k+1;  i  <  A->rows;  i++)  ■( 

519 

520  for  (j  =  Ic;  j  <  A->cols;  j++)  ■{ 

521 

522  A->matrixCi] [j]  -=  (cbuf[i]  *  A->matrix[k] [j] ) ; 

523 

524  }  /*  end  for(j)  ♦/ 

525 

526  }  /*  end  for(i)  */ 

527 

528  t [LOOPTIME]  +s  (clockO  -  start); 

529 

530  } 

531  /•  End  update.GO  - */ 

532 

533 

534 

535  /*  =====================================================================  */ 

536 


537 

main(){ 

538 

539 

double  ecbuf ; 

/* 

column  buffer  holds  one  col  of  A 

*/ 

540 

541 

Double_Matrix_Type  *1; 

/* 

this  node's  portion  of  the  matrix  A 

*/ 

542 

543 

int  cubesize. 

/* 

number  of  processors  in  the  cube 

*/ 

544 

dim, 

/* 

dimension  of  the  hypercube 

*/ 

545 

/* 

global  column  number 

*/ 

546 

i, 

/* 

generic  integer  and  row  ctr 

*/ 

547 

j. 

/* 

generic  integer  and  col  ctr 

*/ 

548 

k, 

/• 

index  to  pivot 

*/ 

549 

m, 

/* 

number  of  rows  in  A  (same  local/all) 

•/ 

550 

me. 

/* 

id  of  this  processor 

*/ 
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551 

ny.cols  =  0, 

/* 

number  of  cols  in 

local  portion  of  A  */ 

552 

n. 

/* 

number  of  cols  in 

all  of  A  ♦/ 

553 

root. 

/* 

host/root  processor  id  */ 

554 

tining; 

/• 

Boolean 

♦/ 

555 

556 

long  sizeof.col. 

/* 

sizes,  in  bytes 

*/ 

557 

sizeof.int. 

558 

sizeof_pivot; 

559 

560 

ticks  start. 

561 

starti; 

/* 

another  start 

*/ 

562 

663  Pivot.Type  pivot; 

564 

565 

566 

567  /* - ========  IIITIALIZATIOI  WORK  ======== - •/ 

568 

569  lor  (i  s  0;  i  <  MAX.EVEBTS;  i++)  t[i]  =  0; 

570 

571  Start  =  t  [START_TIME]  =  clockO; 

572 

573 

574  #ildel  TRAMSPUTER 

575 

576  cubesize  =  CUBESIZE; 

577  dim  =  DINEHSION; 

•  578  initialize_hypercube(dim) ; 

579 

580  telse 

581 

582  cubesize  =  (int)  numnodesO; 

583  din  =  (int)  nodedimO; 

584 

585  #endil 

586 

587  t [DATA_S0URCE]  -  ne  =  (int)  nynodeO; 

588  root  =  (int)  nyhostO; 

589 

590  sizeol.int  =  (long)  sizeoKint); 

591  sizeol.pivot  =  (long)  8izeol(Pivot_Type) ; 

592 

593 

594  /♦  BROADCAST  THE  SIZE(A)  - 

695  ♦ 

596  *  All  node  processors  need  to  know  the  nunber  of  rovs  and  coluans  in 

597  *  the  natriz  A  [i.e.,  8ize(A)] .  A  broadcast  to  the  entire  cube, 

598  e  cubecastO,  is  used  to  achieve  this.  The  nodes  also  need  to  knos 

599  *  vhether  or  not  to  set  tining  on,  so  this  value  is  passed  too. 

600  * 
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601  ♦/ 

602 

603  »ild«f  TRAISPUTER 

604 

605  cnbecastCae,  dim,  (char  *)  ftn,  aizeof.int,  cubasize); 

606  cubeca8t(Be,  dim,  (char  *)  te,  sizaof_int,  cubasize); 

607  cubaca8t(Ba,  diB.  (char  *)  ttlBing,  sizaof.int,  cubasize); 

606 

609  falsa  /*  iPSC/2  */ 

610 

611  cubaca8t(Ba.  diB,  (char  *)  te,  sizaof.int,  ROV.SIZE.TYPE) ; 

612  cubacast(Ba,  diB,  (char  *)  fn.  sizaof.int,  COL.SIZE.TYPE) ; 

613  cubeca8t(Be,  diB,  (char  *)  ftiBing,  sizaof.int,  ARG.TYPE) ; 

614 

615  fendif  /*  TRAISPUTER  ♦/ 

616 

617  sizeof.col  =  (long)  (b  *  s izaol (double) ) ; 

616 

619 

620  /♦  COLUMN  BUFFER  AID  COUNTER  - 

621  * 

622  *  The  column  buffer,  cbufC],  sill  be  used  to  hold  one  column  of  A  at 

623  *  a  time.  Ve  sill  see  cbufU  used  on  a  variety  of  occasions  shan  se 

624  •  Bust  sork  sith  a  column  of  A.  Allocate  cbufD  and  datarmina  the 

625  *  number  of  columns  that  sill  be  stored  locally  (ay.cols) . 

626  • 

627  ♦/ 

626  cbuf  -  (double  *)  Balloc(8izeof_col) ; 

629 

630  for  (i  =  0;  i  <  n;  i++)  {  if  ((i  %  cubasize)  ==  Be)  By_col8++;  > 

631 

632 

633  /*  ESTABLISH  LOCAL  A  - 

634  ♦ 

635  *  Allocate  storage  space  for  this  node’s  part  of  A  (it  is  called  A 

636  *  even  though  it  is  only  part  of  A) . 

637  */ 

636 

639  A  =  Batalloc(B,  By.cols); 

640 

641  tCSETUP]  =  clockO  -  start; 

642 

643  start  =  clockO; 

644 

645  raceiva_coluBn8(diB,  Be,  A,  n,  cbuf,  By.cols,  sizeof.col); 

646 

647  tCDISTRIB.COLS]  =  clockO  -  start; 

646 

649 

650  /*  BEGIN  ITERATION  - 
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1. )  At  the  top  of  the  lorO  loop  *e  have  just  coapleted  update_G(), 

so  the  local  candidate  for  the  next  pivot  ie  situated  in  np[0] . 
The  function  elect.nezt.pivotO  performs  a  series  of  directional, 
exchange ()s  so  that  all  local  candidates  coapete  in  an  election 
process.  The  sinner  is  np[03. 

2. )  If  all  sent  sell,  np[0]  contains  the  next  pivot.  This  inforsia- 

3. )  If  this  node  has  the  pivot  coluan  [if  (p[k]  ~=  gc)] ,  it  aust 

divide  everything  under  the  pivot  by  the  value  of  the  pivot  and 
distribute  the  coluan  to  all  other  nodes  (node  zero  sends  to  host) . 

4. )  Finally,  this  node  aust  perform  the  coaputations  across  the 

Gauss  Transform  area  for  the  local  portion  of  A.  The 
update.GO  function  also  locates  the  next  pivot  sithout  special 
expense.  Then  it  is  tiae  to  go  back  to  the  top  of  the  loop. 


start  =  clockO; 

for  (k  =  0;  k  <  (MIM(B,n));  k++)  { 

pivot. id  =  k  '/,  cubesize; 
pivot.t  =  k; 

/♦  knos  id;  k  ==>  t;  need  s,  u  */ 

if  (pivot. id  ==  ae)  locate_pivot(Be,  cubesize.  A,  k,  kpivot); 
cubeca8t_froB(pivot . id,  ae,  dia,  (char  *)  Apivot,  sizeof.pivot) ; 
if  (ae  ==  0)  { 

starti  =  clockO; 

#ifdef  TRAISPUTER 

send(root,  (char  •)  Apivot,  sizeof .pivot,  cubesize); 

«else  /•  iPSC/2  •/ 

8end(root,  (char  •)  Apivot,  eizeof.pivot ,  PIVOT.TYPE); 

Aendif  /*  TRAISPUTER  •/ 


tCPIVOTS.TO.HOST]  ♦=  (clockO  -  starti); 


ssap_ross(A,  k,  pivot. s); 


701 

702  Start!  =  clockO; 

703 

704  if  (pivot. id  ==  Bs)  { 

705 

706  do_pivot_coluBui_aritluietic(A,  cbnf.  k,  as,  cubesize); 

707  } 

708 

709  t [PCOL.ARITHMETIC]  +=  (clockO  -  start!); 

710 

711  Start!  =  clockO: 

712 

713  ciibscast_froB(pivot. id,  as,  dia,  (cbar  ♦)  cbuf,  sizeof_col); 

714 

715  t [PCOL.DISTRIB]  +=  (clockO  -  start!); 

716 

717 

718  if  (me  ==  0)  { 

719 

720  Start!  =  clockO; 

721 

722  #ifdef  TRAHSPUTER 

723 

724  8ubmit(me,  dim,  (char  •)  cbuf,  sizeof.col,  cubesize); 

725 

726  telse  /*  iPSC/2  ♦/ 

727 

728  submit(me,  dim,  (char  ♦)  cbuf,  sizeof_col,  PCOL.TYPE) ; 

729 

730  kendif  /•  TRAKSPUTER  •/ 

731 

732  t [PC0LS_T0_H0ST]  +=  (clockO  -  start!); 

733  > 

734 

735  Start!  =  clockO; 

736  update_G(A,  cbuf,  cubesize,  k,  me.  a,  Apivot); 

737  t[UPDATIlG_G]  +=  (clockO  -  start!); 

738 

739  > 

740  /*  EID  ITERATIOI  [for(k...)]  - - - •/ 

741 

742  t [ITERATIOI]  =  clockO  -  start; 

743 

744 

745  free(cbuf); 

746 

747  tCSTOP]  =  clockO; 

748 

749  if  (timing)  submit_timing_data(me,  dim); 

750 


350 


gfppnode.c 


751  retuni(SUCCESS) ; 

752  } 

753  /♦ - ===== 


EOF  gf ppnode . c 


-  */ 
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j  /* - ==========  PROGRAM  IlFORMATIOI  =-=^====== - 

2  * 

3  *  SOURCE  gfpcnode.c 

4  *  VERSIOI  :  2.3 

5  *  DATE  17  Septeaber  1991 

6  *  AUTHOR  Jonathan  E.  Hartaan,  U.  S.  laval  Postgraduate  School 

7  *  REMARKS  :  See  gf.h. 

8  * 


10  ♦/ 

11 

12  tinclude  <aath.h> 

13 

14  «ifdef  TRAISPUTER 

15 

16  tinclude  <conc.h> 

17 

18  tinclude  <Batrix.h> 

19  tinclude  <aacro8.h> 

20  tinclude  <allocate.h> 

21  tinclude  <coniin.h> 

22  tinclude  <generate.h> 

23  tinclude  <Bathx.h> 

24  tinclude  <ops.h> 

25  tinclude  <tiBing.h> 

26 

27  telse 

28 

29  tinclude  "/usr/haurtBan/Batlib/aatrix.h” 

30  tinclude  "/usr/hartBan/Batlib/Bacros.h" 

31  tinclude  "/usr/hartBan/Batlib/allocate.h" 

32  tinclude  "/usr/hiurtBan/Batlib/coBB.h" 

33  tinclude  "/usr/hartBan/Batlib/generate.h" 

34  tinclude  "/usr/hartBan/Batlib/Bathx.h" 

35  tinclude  "/usr/hartBan/Batlib/ops .h" 

36  tinclude  "/usr/hartBan/Batlib/tiaing.h" 

37  tendif 

38 

39  tinclude  "gf.h" 

40 

41  tifdef  TRAISPUTER 

42 

43  Channel  «icC(CUBESIZE  *  1)], 

44  eocCCCUBESIZE  +1)]; 

45 

46  tendif 

47 

48 

49  ticks  tTMAX.EVEITS] ; 
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51  /♦ - =====r=====  FUICTIOI  DEFIIITIOI  =========== - 

52  ♦ 

53  *  Alter  this  node  finds  its  candidate  for  next  pivot,  there  must  be  a 

54  *  conparison  with  all  other  nodes.  The  local  candidate  starts  in  npCO] . 

55  *  Direction-by-direction,  candidates  are  exchanged  and  the  winner  is 

56  *  positioned  in  npCO] .  If  there  is  a  tie,  the  candidate  from  the  smaller 

57  e  node  number  wins.  A  RAIK.DEFICIEIT  opponent  is  ignored  (the  local 

56  e  candidate  must  be  at  least  as  good).  In  the  end,  all  processors  have 

59  *  identical  entries  in  np[0] . 

60  */ 

61 

62  #ifdel  PROTOTYPE 


63 

64  void  elect_next_pivot(int  me,  int  dim,  Pivot_T3rpe  *np) 

65 

66  #else 

67 

66  void  elect .next .pivot (me,  dim,  np) 

69 

70  int  me , 

71  dim;  ^ 

72  Pivot.Type  enp; 

73 

74  tend if 

75  { 

76  int  dir; 


77 

76 

79 

60 

61 

82 

63 

64 

65 

66 
87 
66 
69 

90 

91 

92 

93 

94 

95 

96 

97 
96 
99 

100 


long  cubesize  =  poB2(dim), 

len  =  sizeof (Pivot.Type) ; 


lor  (dir  =  1;  dir  <  (int)  cubesize;  dir  «=  1)  { 
if  (dir  !=  8)  { 

directional. exchange (me,  dim,  dir,  (chu  *)  A(np[l]), 

(char  *)  A(np[0]),  len); 

} 

else  -{ 


if  ((me  •/.  8)  !=  0)  { 


/*  we  don’t  want  0  < — >  8  comm  */ 


directional. exchange(me,  dim,  dir,  (char  e)  A(np[l]), 

(char  *)  »(np[0]),  len); 
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101 

102 

103 

104 

105 

106 

107 

108 

109 

110 
111 
112 

113 

114 

115 

116 

117 

118 

119 

120 
121 
122 

123 

124 

125 

126 

127 

128 

129 

130 

131 

132 

133 

134 

135 

136 

137 

138 

139 

140 

141 

142  > 

143  /* 

144 

145 

146  /♦ 

147  * 

148  ♦ 

149  * 


if  (npCl].id  !=  RAIK_DEFICIEIT)  { 

if  (fabsCnpCl] .u)  >  tabs (up [0] .u))  { 

np[0].id  =  npClD.id;  &p[0].u  =  iip[l].u; 
np[0].8  =npCl].8;  np[0].t  =iip[l].t; 

> 

alse  { 

if  (fabsCnpCl] .u)  ==  faba(np[0] .u))  { 

if  (npCl].id  <  npC03.id)  {  /*  aaallast  braaXs  tie  */ 

npC0].id  =  npClD.id;  npCO] .u  =  np[l].u; 
iipC0].8  =npCl].8:  np[03.t  =np[l].t; 


}  /♦  end  if (np[l] .id. . . .)  ♦/ 
}  /*  end  for(dir)  •/ 


/*  Since  there  is  no  direct  connection  between  nodes  0  and  8,  ve  once 
e  again  destroy  the  beauty  and  generality  of  the  hypercube  so  that  se 
*  can  be  sure  that  0  and  8  have  the  best  candidate  for  pivot. 

*/ 


if  (dim  ==  4)  { 

if  ((me  •/.  8)  ==  0)  {  /♦  lodes  0000  and  1000 

directional_receive(me ,  dim,  0x1,  (char  *)  np,  len); 

> 

if  ((me  '/,  8)  ==  1)  {  /*  lodes  0001  and  1001 

directional_8end(na,  dim,  0x1,  (char  *)  np,  len); 


•/ 


*/ 


End  elect_next_pivot() 


*/ 


This  is  only  the  first  part  of  this  file.  The  rest  would  be  similar  to 
gf ppnode . c 


EOF  gfpcnode.c 


•/ 
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